Candidate numbers¶
27515, 30384, 30855
Imports¶
!pip install --quiet stable_baselines3 quantstats finta
!pip install --quiet git+https://github.com/ShinyOrbThing/gym-anytrading
!pip install --quiet mplfinance
!pip install --quiet shimmy
!pip install --quiet ta
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 182.3/182.3 kB 2.1 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.5/45.5 kB 1.3 MB/s eta 0:00:00 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 953.9/953.9 kB 7.0 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Building wheel for gym-anytrading (setup.py) ... done ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 75.0/75.0 kB 1.9 MB/s eta 0:00:00
import quantstats as qs
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import gymnasium as gym
import gym_anytrading
from gym_anytrading.envs import StocksEnv, ForexEnv, TradingEnv, Actions, Positions
from gym import spaces
from stable_baselines3 import A2C, DQN, PPO
from stable_baselines3.common.callbacks import EvalCallback, ProgressBarCallback
from stable_baselines3.common.vec_env import DummyVecEnv, make_vec_env
import shimmy
from finta import TA
from ta.trend import SMAIndicator, EMAIndicator, MACD
from sklearn.preprocessing import StandardScaler
from ta.momentum import RSIIndicator
import statsmodels.api as sm
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.font_manager as fm
import seaborn as sns
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
import datetime
import os
import pickle
from time import time
from enum import Enum
from tqdm import tqdm
from collections import Counter
# ignoring warnings for report
import warnings
warnings.filterwarnings("ignore")
seed=1
from google.colab import drive
# This will prompt you to click on a link, sign in to your Google account, and copy a verification code.
drive.mount('/content/drive', force_remount=True)
Mounted at /content/drive
Deep Reinforcement Learning in Simulated and Real Market Environments¶
Table of Contents¶
Part 1 (Simulation):
Part 2 (Forex):
2.1.1. Custom Environment (upgrading gym_anytrading with new positions and actions)
2.2 PPO and A2C applied to Forex Trading (EUR_USD) with gym_anytrading environment
Abstract¶
This study investigates the efficacy of deep reinforcement learning (DRL) algorithms in trading a single stock within a simplified stock trading environment. In the first half, we explore the performance of Advantage Actor-Critic (A2C), Proximal Policy Optimisation (PPO), and Deep Q-Network (DQN) algorithms in navigating simulated stock signals characterized by sinusoidal waves with varying levels of noise. Our objective is to evaluate the models' resilience to noise and observe how their behavior evolves as the signals approach white noise. Additionally, we compare the performance of these DRL models against a classical benchmark strategy based on mean reversion. In the second part of the project, we extend our analysis to real market data, focusing on minutely data from the foreign exchange trading pair - EUR_USD. This real-world scenario poses greater challenges, so we make modifications to our original environment in an effort to enhance model training and adaptability. Through this comprehensive investigation, we aim to provide insights into the effectiveness of DRL algorithms in practical trading scenarios, shedding light on their strengths, limitations, and potential applications in financial markets.
Introduction¶
Motivation & Literature Review¶
The integration of advanced computational techniques to develop trading strategies has become a key research area in quantitative finance. In recent years, DRL has emerged as a powerful tool, capable of navigating complex, dynamic environments, including financial markets. This study is divided into two parts. Firstly, we investigate the application of DRL algorithms, specifically A2C, PPO and DQN in a single-stock trading environment. Simulated trading environments offer a controlled setting to test the robustness and effectiveness of trading algorithms under various market conditions. For instance, Nevmyvaka et al., 2006 employed machine learning to optimise execution in electronic trading, using simulated environments to validate their performance in idealised versus noisy conditions, providing valuable insights into their resillience and adaptability. Our motivation to is similar, as we seek to understand the strengths and weaknesses of RL for timing buy and sell executions, but our environment differs in representation. We consider OHLCV price data, while Nevmyvaka et al. focused on limit-order books.
In several recent studies, simulated stock prices have been used for model validation. Just last year, Taherizadeh et al., 2023 explored stock market dynamics using agent-based model (ABM) simulations. Specifically, they developed a simulated stock market, where a large number of agents were deployed to select an optimal portfolio based on several technical criteria. The relative performance of different agents was studied in order to draw inference on which trading strategies are favourable. The mathematical models used to generate policies in Taherizadeh et al. do not use reinforcement learning, but in simulating the stock market, they studied the relative performance of various paradigms. In part 1 of our study, we have similar goals.
The application of RL agents to electronic trading is an emerging research area with promising developments in recent years. Both single-stock trading and portfolio optimisation problems have been tackled using RL. In the late 90's, Moody & Saffell, 1998 demonstrated the effectiveness of RL for trading in practice, outperforming the S&P 500 Stock Index for the 25 year period 1970 through 1994. More recently, Liu et al., 2022 trained RL agents to trade 30 individual stocks, comparing model performance to the Dow Jones Industrial Average and a traditional min-variance portfolio allocation strategy. They focused on the deep deterministic policy gradient (DDPG) algorithm, which is more suited to high dimensional state-action representations (30 stocks, can buy or sell any stock at each time point). In our study, we focus on the single-stock setting, where the objective is to time our buy and sell actions on a single stock to maximise return. Also, we consider mean-reversion trading strategy as our traditional benchmark rather than min-variance which is more suited for portfolio allocation. In Section 1.4.4, we explain why the mean reversion benchmark is effectively the performance upper bound when the price is generated by a deterministic, periodic price signal.
Usha et al., 2019 combined simulated and real data for forex and commodity trading. The outcomes indicated that the trading agent, trained on simulated data and tested on real forex data, outperformed the market by generating higher returns and minimizing losses, particularly in bear markets. While the model excelled in capturing short-term trends and executing profitable trades, it showed limitations in identifying long-term trends. In this project, we were interested in investigating the consistence of strategies for the model trained on real market data to complement the part with simulated data.
Regarding Deep Reinforcement Learning research for real market trading, various studies have focused on the DQN method specifically, with results of varying success. Part of this difference of the outcomes depends on the specific neural network architectures applied in the study. Chen & Gao, 2019 showcase the impact of the number of layers in the Deep Q-network (DQN) on automated stock trading strategies. They highlight how the network's depth influences the trading strategy. The study reveals that DQN can effectively learn profitable patterns from stock trading data and achieve high accumulated rewards. Additionally, introducing recurrence into the Q-learning process (DRQN) may lead to improvements in stock trading execution. These findings determine the importance of network architecture in enhancing the effectiveness of automated stock trading systems. That is why in this project, we experiment with neural net architecture design and its connection to trading strategies. Below we outline the specific research questions we tackled throughout the project.
Research Questions¶
We aimed to address the following research questions in this project.
Part 1 (Simulations):
- How do the algorithms compare in terms of profitability when trading on a periodic, deterministic price movement
- Furthermore, as we inject noise into the signal, how do the model rankings change?
- How well do the RL models generalise to new time periods, with the same data generating process?
Part 2 (Forex):
- Is there a relationship between the trading strategy chosen and the DQN model architecture? (positions and actions considered to depict trading strategy)
- Does the number of training steps for the model influence the strategies learnt by DQN?
- How well are the PPO and A2C models able to generalise when trained and tested on periods of significantly different trading activity within one year?
1.1 Trading Environment¶
Our trading envirnoment is based on gym-anytrading, an extension to OpenAI gym library. We made changes to the environment as required for our experimets. This is a simplified version of a real stock environment with fewer degrees of freedom (More on this below).
The gym-anytrading trading environment provides a simplified framework for evaluating trading strategies using RL algorithms. Within this environment, two key positions are defined: $\text{Long}$ indicating ownership of shares in the asset, and $\text{OOM}$ (Out-of-market) indicating absence of ownership. In the original gym-anytrading documentation, the positions are instead named Long=1, Short=0, but from the source code we saw that there isn't really Short position functionality. We raised this issue with the developers here, and we proceeed with the trading positions as we redefined above.
State Representation¶
- Current price $P_t$,
- First-difference of current price. $D_t = P_t - P_{t-1}$
- Let $X_t$ denote position at time $t$. Then $X_t =$
- Let $Y_t$ denote the %Profit value at time $t$. Then $Y_t =$
Although trading indicators were considered, we found them ineffective for training RL models on the simulated sinusoidal signals so we excluded these. But for the Mean Reversion Benchmark, a 2-month moving average is included.
$\text{MA} = \frac{{\sum_{i=t-2r}^{t} P_i}}{2r}$
Where $r$ is the period of the simulated sine wave, which is set to $31$.
Actions¶
$A_t$ the action at time t, can take two values:
- $\text{Buy}=1$
- $\text{Sell}=0$
Notably, consecutive Buy or Sell actions do not trigger trades; instead, two consecutive Buy actions correspond to a buy-and-hold strategy for the subsequent time point.
Reward Update¶
Let $R_t$ denote the reward at time $t$, then $R_t = $ \begin{cases} R_{t-1} + (P_t - P_{t-s}) & \text{If } A_t = 0\text{ and } X_t = 1, \\ R_{t-1} & \text{Otherwise} \end{cases}
Where $R_{t-1}$ is the previous reward value, $P_t$ is the current price, and $P_{t-s}$ is the price at which we bought the shares, which was $s$ time steps ago. Given the conditions above, the reward function will only updagte when we are closing a position (selling held shares).
Difference between % Profit and the Reward function¶
The reward function is essentially the difference between the selling and buying price of a share. However, the profit metric is expressed as a percentage, calculated based on the current % profit multiplied by % price change (taking into account trading fees). This approach assumes an "all-in" strategy. As in, the maximum possible shares are purchased with each buy decision. As a result, the reward value and profit are not directly proportional.
For our experiments, we consider the % profit values when computing performance metrics rather than reward function values, which are used solely for training the RL models. We assume an "all-in" strategy, where the agent invests all possible shares with each buy decision. This simplification significantly reduces the action-space dimensions in the trading environment, making the model's easier to train with our compute constraints.
1.2 Simulated Price Data¶
We use simulated sine waves to represent stock prices. These were used due their periodic, predictable pattern. Our hypothesis was that the models would learn the optimal policy for a noiseless sine-wave fairly easily, and that as noise is added the performance will degrade. Studying the differences in behaviour as the signal gets more difficult was a particular area of interest for us. We talk more about this in the analysis.
We simulate sinusoidal prices according to the following equation:
$$P_t = \alpha + \beta \times \sin(m \times t) + \epsilon_t$$Where $\alpha$ is the center-line (price when $sin(m\times t) = 0$), $\beta$ is the amplitude of the sine wave, and $m$ is the number of months in the dataset. $\epsilon_t \sim N(0, \sigma^2)$ is a gaussian noise parameter with standard deviation $\sigma$. This is the parameter that we vary for each experiment, such that the signal becomes increasingly hard to predict.
# Simulations are formatted as daily Open, High, Low, Close prices and Volume (OHLCV) data
TRAIN_START_DATE = '2017-01-01'
TRAIN_END_DATE = '2017-12-31'
TRADE_START_DATE = '2018-01-01'
TRADE_END_DATE = '2018-12-31'
np.random.seed(seed)
# Function to simulate sinusoidal stock data
def simulate_sinusoidal_stock_data(ticker, start_date=TRAIN_START_DATE,
end_date=TRADE_END_DATE, frequency='B', amplitude=20,
rpm=1, base_price=100,
volume_mean=1000, volume_std=100, noise=0,
trend=0, indicators=False, window_size=5, seed=seed):
'''
Simulates sinusoidal stock price data with optional noise. Volume is not used,
and is only generated to fulfil the OHLCV (Open, High, Low, Close, Volume) data format
Parameters:
- ticker (str): Ticker symbol of the stock.
- frequency (str, optional): Frequency of data points. Defaults to 'B' (business day).
- amplitude (float, optional): Amplitude of the sinusoidal function. Defaults to 20.
- rpm (int, optional): Wave revolutions per month. Defaults to 1.
- base_price (float, optional): Initial price of the stock. Defaults to 100.
- noise (float, optional): Standard deviation of Gaussian noise to add to the price data. Defaults to 0.
- trend (float, optional): Linear trend added to the price data. Defaults to 0.
- indicators (bool, optional): Whether to include additional technical indicators. Defaults to False.
- window_size: window size specification for rolling window indicators
- seed (int, optional): Seed for random number generation. Defaults to 0.
Note that volume is not contextual in this implementation. It is included
for data formatting in the stock environment.
'''
if seed:
np.random.seed(seed)
date_range = pd.date_range(start=start_date, end=end_date, freq=frequency)
num_days = len(date_range)
num_months = len(pd.date_range(start=start_date, end=end_date, freq='M'))
# simulate time variable
time = np.linspace(0, 2 * np.pi, num_days)
# we want num_months revolutions in the sine wave, so period = num_months*rpm
prices = base_price + amplitude * np.sin(num_months * rpm * time) + np.random.normal(0, noise, num_days) + time*trend
# simulate volume (not used by our environment)
volume = np.random.normal(volume_mean, volume_std, num_days).astype(int)
# Create a DataFrame
sim_df = pd.DataFrame({
'Date': date_range,
'Open': prices,
'High': prices + np.random.uniform(0, amplitude, num_days),
'Low': prices - np.random.uniform(0, amplitude, num_days),
'Close': prices,
'Volume': volume
})
sim_df.set_index('Date', inplace=True)
# coerce data into correct format
sim_df['tic'] = ticker.lower()
sim_df['Date'] = sim_df.index
sim_df['day'] = sim_df.Date.dt.dayofweek
#sim_df.date = sim_df.Date.dt.strftime('%Y-%m-%d')
sim_df = sim_df.sort_index()
# including indicators shortens the series slightly
if indicators:
sim_df['SMA'] = TA.SMA(sim_df, window_size)
sim_df[['MACD', 'MACD_SIG']] = TA.MACD(sim_df, window_size)
sim_df['RSI'] = TA.RSI(sim_df)
sim_df = sim_df.iloc[window_size+5:, :]
return sim_df
# Define parameters
ticker_symbol = 'SINE'
start_date = TRAIN_START_DATE
end_date = TRADE_END_DATE
| Open | High | Low | Close | Volume | tic | Date | day | SMA | MACD | MACD_SIG | RSI | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2017-01-16 | 104.786313 | 113.799179 | 102.676172 | 104.786313 | 919 | sine | 2017-01-16 | 0 | 113.442789 | -1.625462 | 1.149142 | 45.079044 |
| 2017-01-17 | 99.033732 | 105.722120 | 87.929226 | 99.033732 | 974 | sine | 2017-01-17 | 1 | 109.306631 | -3.999533 | 0.043423 | 36.215511 |
| 2017-01-18 | 93.361843 | 97.721222 | 91.462722 | 93.361843 | 1038 | sine | 2017-01-18 | 2 | 104.393292 | -6.515011 | -1.344570 | 29.960466 |
| 2017-01-19 | 88.244295 | 106.759791 | 76.624614 | 88.244295 | 971 | sine | 2017-01-19 | 3 | 99.113076 | -8.943269 | -2.934223 | 25.654944 |
| 2017-01-20 | 84.108446 | 97.901234 | 79.048844 | 84.108446 | 960 | sine | 2017-01-20 | 4 | 93.906926 | -11.048759 | -4.616314 | 22.802909 |
# Generate simulated data
%matplotlib inline
np.random.seed(seed)
sigma_vec = np.arange(0, 21, 4)
sine_data_dict = {}
for noise in sigma_vec:
sine_data = simulate_sinusoidal_stock_data(ticker_symbol, start_date, end_date, noise=noise, amplitude=10, indicators=False)
sine_data_dict[noise] = sine_data
sigmas = sorted(sine_data_dict.keys())
sigmas
[0, 4, 8, 12, 16, 20]
Train Test Split¶
Each full simulation consists of 521 data points, each representing a daily reported price. The dataset is divided like so.
Training:
The first 50% of data is used for training the RL models. This time period isn't considered in our evaluation.
Validation:
The next 25% is used for our evaluation callback (More on this later).
Testing: Last 25% is used as the out-of-time test period. From this, we can evaluate how the model will perform on a new unseen time period.
train_start_date = "2017-01-02"
valid_end_date = "2018-06-29"
df = sine_data_dict[0]
train_val_df = df.loc[(df.index >= train_start_date) & (df.index <= valid_end_date), :]
train_size = round(train_val_df.shape[0]*0.66)
val_size = train_val_df.shape[0] - train_size
test_size = df.shape[0] - train_val_df.shape[0]
print("Spliit sizes:", train_size, val_size, test_size)
Spliit sizes: 257 133 131
Simulations¶
Here we plot a sample of prices from each simulation. The noise parameter $\sigma$ is increased at intervals of 4 in the range 0 to 20.
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12, 6))
for i, ax in enumerate(axes.flatten()):
df = sine_data_dict[sigmas[i]].copy()
ax.plot(df.loc[(df.index > '2017-01-01') & (df.index < '2017-07-01'), 'Close'])
ax.set_title(f'$\sigma = {sigmas[i]}$')
# Set major ticks format and interval
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=2))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
plt.tight_layout()
plt.show()
ACF structure of simulations¶
Below are plots of the Autocorrelation Function (ACF) for each simulated stock. The noiseless sine wave is deterministic, as is reflected by the correlation of 1 at regular lag intervals.
As we increase the noise parameter, the ACF structure gets flatter as price approaches a white noise profile.
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12, 6))
for i, ax in enumerate(axes.flatten()):
# Calculate and plot ACF
sm.graphics.tsa.plot_acf(sine_data_dict[sigmas[i]].Close, lags=40, ax=ax)
ax.set_title(f'ACF for $\sigma = {sigmas[i]}$')
plt.tight_layout()
plt.show()
1.3 Environment Setup¶
# Note: Short position is actually "Out-Of-Market" as described above
class Actions(Enum):
Sell = 0
Buy = 1
class Positions(Enum):
Short = 0
Long = 1
def opposite(self):
return Positions.Short if self == Positions.Long else Positions.Long
class TradingEnv(gym.Env):
metadata = {'render_modes': ['human'], 'render_fps': 3}
def __init__(self, df, window_size, render_mode=None):
assert df.ndim == 2
assert render_mode is None or render_mode in self.metadata['render_modes']
self.render_mode = render_mode
self.df = df
self.window_size = window_size
self.prices, self.signal_features = self._process_data()
self.shape = (window_size, self.signal_features.shape[1])
# spaces
self.action_space = gym.spaces.Discrete(len(Actions))
INF = 1e10
self.observation_space = gym.spaces.Box(
low=-INF, high=INF, shape=self.shape, dtype=np.float32,
)
# episode
self._start_tick = self.window_size
self._end_tick = len(self.prices) - 1
self._truncated = None
self._current_tick = None
self._last_trade_tick = None
self._position = None
self._position_history = None
self._total_reward = None
self._total_profit = None
self._first_rendering = None
self.history = None
def reset(self, seed=None, options=None):
super().reset(seed=seed, options=options)
self.action_space.seed(int((self.np_random.uniform(0, seed if seed is not None else 1))))
self._truncated = False
self._current_tick = self._start_tick
self._last_trade_tick = self._current_tick - 1
self._position = Positions.Short # initialises in Short position automatically
self._position_history = (self.window_size * [None]) + [self._position]
self._total_reward = 0.
self._total_profit = 1. # unit
self._first_rendering = True
self.history = {}
observation = self._get_observation()
info = self._get_info()
if self.render_mode == 'human':
self._render_frame()
return observation, info
def step(self, action):
self._truncated = False
self._current_tick += 1
if self._current_tick == self._end_tick:
self._truncated = True
step_reward = self._calculate_reward(action)
self._total_reward += step_reward
self._update_profit(action)
trade = False
if (
(action == Actions.Buy.value and self._position == Positions.Short) or
(action == Actions.Sell.value and self._position == Positions.Long)
):
trade = True
if trade:
self._position = self._position.opposite()
self._last_trade_tick = self._current_tick
self._position_history.append(self._position)
observation = self._get_observation()
info = self._get_info()
self._update_history(info)
if self.render_mode == 'human':
self._render_frame()
return observation, step_reward, False, self._truncated, info
def _get_info(self):
return dict(
total_reward=self._total_reward,
total_profit=self._total_profit,
position=self._position
)
def _get_observation(self):
return self.signal_features[(self._current_tick-self.window_size+1):self._current_tick+1]
def _update_history(self, info):
if not self.history:
self.history = {key: [] for key in info.keys()}
for key, value in info.items():
self.history[key].append(value)
def _render_frame(self):
self.render()
def render(self, mode='human'):
def _plot_position(position, tick):
color = None
if position == Positions.Short:
color = 'red'
elif position == Positions.Long:
color = 'green'
if color:
plt.scatter(tick, self.prices[tick], color=color)
start_time = time()
if self._first_rendering:
self._first_rendering = False
plt.cla()
plt.plot(self.prices)
start_position = self._position_history[self._start_tick]
_plot_position(start_position, self._start_tick)
_plot_position(self._position, self._current_tick)
plt.suptitle(
"Total Reward: %.6f" % self._total_reward + ' ~ ' +
"Total Profit: %.6f" % self._total_profit
)
end_time = time()
process_time = end_time - start_time
pause_time = (1 / self.metadata['render_fps']) - process_time
assert pause_time > 0., "High FPS! Try to reduce the 'render_fps' value."
plt.pause(pause_time)
def render_all(self, title=None):
window_ticks = np.arange(len(self._position_history))
plt.plot(self.prices)
short_ticks = []
long_ticks = []
for i, tick in enumerate(window_ticks):
if self._position_history[i] == Positions.Short:
short_ticks.append(tick)
elif self._position_history[i] == Positions.Long:
long_ticks.append(tick)
plt.plot(short_ticks, self.prices[short_ticks], 'ro')
plt.plot(long_ticks, self.prices[long_ticks], 'go')
if title:
plt.title(title)
plt.suptitle(
"Total Reward: %.6f" % self._total_reward + ' ~ ' +
"Total Profit: %.6f" % self._total_profit
)
def render_all_pretty(self, title=None):
window_ticks = np.arange(len(self._position_history))
plt.figure(figsize=(12, 6))
# plot prices with a more subtle line color and width for dashboard aesthetics
plt.plot(window_ticks, self.prices, color='#1f77b4', linewidth=2, label='Price', zorder=1)
# using 'v' for short (downward pointing triangle) and '^' for long (upward pointing triangle)
short_ticks = [tick for i, tick in enumerate(window_ticks) if self._position_history[i] == Positions.Short]
long_ticks = [tick for i, tick in enumerate(window_ticks) if self._position_history[i] == Positions.Long]
plt.scatter(short_ticks, np.array(self.prices)[short_ticks], color='red', marker='v', s=100, label='Short Position', zorder=4)
plt.scatter(long_ticks, np.array(self.prices)[long_ticks], color='green', marker='^', s=100, label='Long Position', zorder=5)
# title and subtitles with improved layout
if title:
plt.title(title, fontsize=16, fontweight='bold')
#plt.suptitle("Trading Dashboard", fontsize=18, fontweight='bold')
plt.title(
"Total Reward: %.6f" % self._total_reward + ' ~ ' +
"Total Profit: %.6f" % self._total_profit,
loc='left', fontsize=12, style='italic'
)
#plt.legend(frameon=True, facecolor='white', framealpha=0.8, fontsize=10)
plt.xlabel('Time', fontsize=14, fontweight='bold')
plt.ylabel('Price', fontsize=14, fontweight='bold')
plt.grid(color='gray', linestyle='--', linewidth=0.5, alpha=0.7)
plt.gca().set_facecolor('whitesmoke')
plt.draw()
plt.tight_layout()
def close(self):
plt.close()
def save_rendering(self, filepath):
plt.savefig(filepath)
def pause_rendering(self):
plt.show()
def _process_data(self):
raise NotImplementedError
def _calculate_reward(self, action):
raise NotImplementedError
def _update_profit(self, action):
raise NotImplementedError
def max_possible_profit(self): # trade fees are ignored
raise NotImplementedError
class StocksEnv(TradingEnv):
def __init__(self, df, window_size, frame_bound, render_mode=None):
assert len(frame_bound) == 2
self.frame_bound = frame_bound
super().__init__(df, window_size, render_mode)
self.trade_fee_bid_percent = 0.01 # unit (buying fees)
self.trade_fee_ask_percent = 0.005 # unit (selling fees)
def _process_data(self):
prices = self.df.loc[:, 'Close'].to_numpy() # use close price as current share price
prices[self.frame_bound[0] - self.window_size] # validate index (TODO: Improve validation)
prices = prices[self.frame_bound[0]-self.window_size:self.frame_bound[1]]
diff = np.insert(np.diff(prices), 0, 0) # price difference added to features for model training
signal_features = np.column_stack((prices, diff)) # we will include indicators by overriding this function.
return prices.astype(np.float32), signal_features.astype(np.float32)
def _calculate_reward(self, action):
''' This reward function only considers long positions,
Long position reward calculated as expected. Reward isn't applied to long position until a sell action'''
step_reward = 0
trade = False
if (
(action == Actions.Buy.value and self._position == Positions.Short) or
(action == Actions.Sell.value and self._position == Positions.Long)
):
trade = True
if trade:
current_price = self.prices[self._current_tick]
last_trade_price = self.prices[self._last_trade_tick]
price_diff = current_price - last_trade_price # reward calc
if self._position == Positions.Long:
step_reward += price_diff # reward updates when closing long position
return step_reward
def _update_profit(self, action):
trade = False
if (
(action == Actions.Buy.value and self._position == Positions.Short) or
(action == Actions.Sell.value and self._position == Positions.Long)
):
trade = True
# update profit when closing long position
if trade or self._truncated:
current_price = self.prices[self._current_tick]
last_trade_price = self.prices[self._last_trade_tick]
# profit calculated as current profit * % price change, taking into account fees
if self._position == Positions.Long:
shares = (self._total_profit * (1 - self.trade_fee_ask_percent)) / last_trade_price
self._total_profit = (shares * (1 - self.trade_fee_bid_percent)) * current_price
# overriding gym-anytrading data processing for environment setup
def my_process_data(env):
prices = env.df.loc[:, 'Close'].to_numpy()
prices = prices[env.frame_bound[0]-env.window_size:env.frame_bound[1]]
diff = np.insert(np.diff(prices), 0, 0) # include first difference as a feature
# additional indicators (left out by default)
try:
sma = env.df.loc[:, 'SMA'].to_numpy()
macd = env.df.loc[:, 'MACD_SIG'].to_numpy()
rsi = env.df.loc[:, 'RSI'].to_numpy()
sma = sma[env.frame_bound[0]-env.window_size:env.frame_bound[1]]
macd = macd[env.frame_bound[0]-env.window_size:env.frame_bound[1]]
rsi = rsi[env.frame_bound[0]-env.window_size:env.frame_bound[1]]
signal_features = np.column_stack((prices, diff, sma, macd, rsi))
except:
print("(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)")
signal_features = np.column_stack((prices, diff))
return prices.astype(np.float32), signal_features.astype(np.float32)
class MyStocksEnv(StocksEnv):
_process_data = my_process_data
# environment setup for the Mean Reversion Benchmark
def my_process_data_bm(env):
prices = env.df.loc[:, 'Close'].to_numpy()
prices = prices[env.frame_bound[0]-env.window_size:env.frame_bound[1]]
diff = np.insert(np.diff(prices), 0, 0)
# include 2-month MA for mean reversion policy
sma = env.df['Close'].rolling(window=round(period*2)).mean().to_numpy()
sma = sma[env.frame_bound[0]-env.window_size:env.frame_bound[1]]
signal_features = np.column_stack((prices, diff, sma))
return prices.astype(np.float32), signal_features.astype(np.float32)
class MyStocksEnvBM(StocksEnv):
_process_data = my_process_data_bm
Training & Evaluation Functions¶
Functions used for sequential model training and evaluation
def build_agent(df, window_size,
train_timesteps, algo,
bm_path, bm_eval_path,
seed=seed,
eval_freq=1000,
custom_model_params = False,
model_params={}):
"""
Builds and trains a reinforcement learning agent for trading using the specified algorithm.
Parameters:
- df (DataFrame): DataFrame containing stock data.
- window_size (int): Size of the observation window.
- train_timesteps (int): Total number of training timesteps.
- algo (str): Algorithm to use for training. Options: 'dqn', 'ppo', 'a2c'.
- bm_path (str): Path to save the best model.
- bm_eval_path (str, optional): Path to store reward metrics on validation set.
- seed (int, optional): Random seed for reproducibility.
Returns:
- Model: Trained reinforcement learning model.
"""
# create environment
start_index = window_size
end_index = round(len(df)*0.66)
env = MyStocksEnv(df=df,
window_size=window_size,
frame_bound=(start_index, end_index))
# using evaluation callback to retain the best model
eval_start_index = end_index-1
eval_end_index = len(df)
eval_env = MyStocksEnv(df=df,
window_size=window_size,
frame_bound=(eval_start_index, eval_end_index))
eval_callback = EvalCallback(eval_env, best_model_save_path=bm_path,
log_path=bm_eval_path, eval_freq=eval_freq,
deterministic=True, render=False)
env.reset(seed=seed)
print('observation_space:', env.observation_space)
if algo.lower() == 'dqn':
model = DQN('MlpPolicy', env, seed=seed)
elif algo.lower() == 'ppo':
model = PPO('MlpPolicy', env, seed=seed)
elif algo.lower() == 'a2c':
model = A2C('MlpPolicy', env, seed=seed)
else:
raise Exception('Must specify a model (A2C, PPO, DQN)')
model.learn(total_timesteps=train_timesteps, callback=eval_callback, **model_params)
print(f'Training performance logged to {bm_eval_path}')
print('Returning best model')
if algo.lower() == 'dqn':
bm = DQN.load(os.path.join(bm_path, 'best_model.zip'))
elif algo.lower() == 'ppo':
bm = PPO.load(os.path.join(bm_path, 'best_model.zip'))
elif algo.lower() == 'a2c':
bm = A2C.load(os.path.join(bm_path, 'best_model.zip'))
return bm
def evaluate_agent(df, model, algo, sigma, window_size, seed=seed, save=False, fig_savepath='', metrics_savepath=''):
'''
Evaluate RL agent using stock data from specified df.
Dataframe must be in the same form as the one used to train the agent.
Parameters:
- df (DataFrame): DataFrame containing stock data.
- model: Trained reinforcement learning agent model.
- window_size (int): Size of the observation window.
- seed (int): Random seed for reproducibility.
Returns:
- dict: Dictionary containing evaluation metrics. Also necessary metadata
for building results dataframe
'''
# Create environment
start_index = window_size
end_index = len(df)
env = MyStocksEnv(df=df,
window_size=window_size,
frame_bound=(start_index, end_index))
action_stats = {Actions.Sell: 0, Actions.Buy: 0}
observation, info = env.reset(seed=seed)
while True:
action, _states = model.predict(observation)
action_stats[Actions(action)] += 1
observation, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated
if done:
break
env.close()
# computing metrics
qs.extend_pandas()
net_worth = pd.Series(env.unwrapped.history['total_profit'], index=df.index[start_index+1:end_index])
returns = net_worth.pct_change().iloc[1:]
metrics = {}
metrics['algorithm'] = algo
metrics['sigma'] = sigma
metrics['profit_ts'] = net_worth
metrics['cumulative_return'] = net_worth[-1] - net_worth[0]
metrics['sharpe'] = qs.stats.sharpe(returns)
metrics['daily_return'] = qs.stats.expected_return(returns)
metrics['pct_in_market'] = np.round(action_stats[Actions.Buy] /(action_stats[Actions.Buy]+action_stats[Actions.Sell]), 2) # % time in market
metrics['max_drawdown'] = qs.stats.max_drawdown(returns)
print('action_stats:', action_stats)
print('info:', info)
if save:
env.unwrapped.render_all_pretty()
plt.savefig(fig_savepath)
print(f"Plot of episode saved to {fig_savepath}")
with open(metrics_savepath, 'wb') as file:
pickle.dump(metrics, file)
print(f"Metrics saved to {metrics_savepath}")
return metrics
1.4.1 Algorithms¶
Here we briefly describe the reinforcement learning algorithms used to train the trading agents. Let $s_t$ and $a_t$ denote the state, action and reward variables at time $t$ respectively. Let $R_t$ denote the reward at time $t$.
Proximal Policy Optimisation (PPO)¶
PPO operates by optimizing a surrogate objective function, which guarantees small updates to the policy, improving stability and preventing the policy from changing too drastically in a single update step. This is achieved by clipping the policy probability ratio, ensuring that it stays within a predefined range around 1. PPO objective function to includes a clipping mechanism that limits the policy update step, making training more stable than simply optimising on expected return. The objective function is defined as:
$$ L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t) \right] $$Where $r_t(\theta)$ is the probability ratio, $\hat{A}_t$ is the advantage estimate at time $t$, and $\epsilon$ is a small constant, determining how much the ratio can deviate from 1. The policy probability ratio is calculated by dividing the probability under the new policy by the probability under the old one. The ratio helps in assessing how much the new policy deviates from the previous one.
$$ r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$$PPO uses the advantage function, calculated as $\hat{A}_t = R_t - V(s_t)$ to measure how much better or worse an action is compared to the policy's average. This helps in effectively scaling the policy updates. Schulman et al., 2017 outlines the PPO algorithm in more detail.
Advantage Actor-Critic (A2C)¶
A2C optimizes both a policy function (actor) and a value function (critic) concurrently. It's designed to balance learning efficiency and stability using multiple agents that operate in parallel environments. The actor determines actions basaed on the policy $\pi(a|s, \theta)$, while the critic estimates the value of states via the value function $V(s, \phi)$ where $\phi$ is the set of hyperparameters. The Advantage function is used to refine the actor's policy. In A2C, the advantage function is calculated as $A(s_t, a_t) = Q(s_t, a_t) - V(s_t)$.
The Critic is updated using Temporal Difference learning, specifically the TD error $\delta_t$, calculated as
$$\delta_t = R_t + \gamma V(s_{t+1}, \phi) - V(s_t, \phi)$$The Actor is updated according to the policy gradient, adjusted by estimates from the advantage function.
$$\nabla_\theta J \approx \mathbb{E}\left[\nabla_\theta \log \pi(a_t|s_t, \theta) A(s_t, a_t)\right] $$Although A2C is more difficult to train than PPO on simple tasks (OpenRL Benchmark), it is often more effective in complex environments. See Mnih et al., 2016 for a more detailed explanation of the A2C algorithm.
Deep Q-network (DQN)¶
DQN combines Q-learning with deep neural networks to handle complex, high-dimensional environments. DQN uses a neural network to approximate the Q-value function, which represents the value of taking an action in a given state. Several changes from traditional Q-learning are considered. Namely, the Q-network, Experience replay, and the Target network. This algorithm was introduced by DeepMind in 2013 (Mnih et al., 2016).
- The Q-value function $Q(s, a; \theta)$ is approximated using the Q-network, which is often a feed-forward neural network, with $\theta$ denoting the parameters of the network.
- Experience Replay is used to brak the correlation between consecutive samples and improve training stability. DQN stores agent experiences ($s_t$, $a_t$, $r_t$, $s_{t+1}$) in a replay buffer and samples randomly from this buffer to perform updates.
- A separate, slowly updated network provides the target Q-values during training, which helps stablise learning. The network is identical to teh Q-network but with parameters $\theta^-$ that leg behind the Q-network's parameters.
Learning: The agent interacts with the environment, and stores experiences ($s_t$, $a_t$, $r_t$, $s_{t+1}$) in the replay buffer. Mini-batches of experiences are sampled from the buffer, which are used to calculate the loss function. The loss function used to train the network is based on the mean-squared error between the current Q-values and target Q-values:
$$L(\theta) = \mathbb{E}\left[\left(r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta)\right)^2\right] $$Q-network parameters are updated using gradient descent. Every $T$ timesteps ($T$ is large), the Target network is updated with prameters from the Q-network.
1.4.2 Training RL Agents¶
Here we train A2C, PPO, and DQN agents on each of the simulated stock signals.
Models are then evaluated on the out-of-time test set to assess performance on unseen price movements. Note that all model paramers are kept consistent for each simlulation.
metrics = []
proj_path='/content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/'
window_size = 10
train_timesteps = 140000
algorithms=['ppo', 'a2c', 'dqn']
train_start_date = "2017-01-02"
valid_end_date = "2018-06-29"
seed = 1
results_table_path = os.path.join(proj_path, 'metrics/tables/metrics_table.csv')
profit_curve_path = os.path.join(proj_path, 'metrics/tables/profit_ts.csv')
val_metrics_path = os.path.join(proj_path, 'metrics/tables/val_metrics.csv')
skip = True
if not skip:
for algo in algorithms:
for i, sigma in enumerate(sigmas):
print(f"Building {algo} for sigma = {sigma}")
print("-"*150)
# directories for storing results
fig_path = os.path.join(proj_path, 'plots/'+algo+'_sine'+str(sigma)+'_plot')
metrics_path = os.path.join(proj_path, 'metrics/'+algo+'_sine'+str(sigma)+'_metrics.pkl')
bm_path = os.path.join(proj_path,'models/'+algo+'_sine_'+str(sigma))
bm_eval_path = os.path.join(proj_path, 'logs/'+algo+'_sine'+str(sigma)+'_eval')
df = sine_data_dict[sigma]
train_val_df = df.loc[(df.index >= train_start_date) & (df.index <= valid_end_date), :]
test_df = df.loc[(df.index >= valid_end_date), :]
bm = build_agent(train_val_df, window_size,
train_timesteps, algo,
bm_path=bm_path, bm_eval_path=bm_eval_path,
eval_freq=1000,
seed=1)
# metrics are saved indivudually for models, and together in a dataframe
# profit curves are saved to dataframe.
bm_metrics = evaluate_agent(test_df, bm, algo, sigma, window_size, seed, save=True, fig_savepath=fig_path, metrics_savepath=metrics_path)
metrics.append(bm_metrics)
# save metrics
metrics_df = pd.DataFrame(columns=['algorithm', 'sigma', 'cumulative_return', 'sharpe_ratio', 'daily_return', 'pct_in_market', 'max_drawdown'])
profit_df = pd.DataFrame()
for i, m in enumerate(metrics):
metrics_df.loc[i, :] = [m['algorithm'], m['sigma'], m['cumulative_return'], m['sharpe'],
m['daily_return'], m['pct_in_market'],
m['max_drawdown']]
profit_df[m['algorithm']+"_"+str(m['sigma'])] = m['profit_ts']
metrics_df.to_csv(results_table_path, index=False)
profit_df.to_csv(profit_curve_path, index=False)
print(f"Saved metrics table to {results_table_path}")
print(f"saved profit curves to {profit_curve_path}")
Load in Trained Models & Score them on the Validation Period¶
Here we read back in the models trained above and use them to evaluate the model on the validation period. Re-evaluating models and generating plots without retraining
# to retrieve trained models
original_proj_path = '/content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/'
proj_path='/content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation'
window_size = 10
train_timesteps = 140000
algorithms=['ppo', 'a2c', 'dqn']
train_val_df = df.loc[(df.index >= train_start_date) & (df.index <= valid_end_date), :]
valid_start_date = train_val_df.iloc[round(len(train_val_df)*0.66), :].Date
train_start_date = "2017-01-02"
valid_end_date = "2018-06-29"
results_table_path = os.path.join(proj_path, 'metrics/tables/metrics_table.csv')
profit_curve_path = os.path.join(proj_path, 'metrics/tables/profit_ts.csv')
metrics = []
for algo in algorithms:
for i, sigma in enumerate(sigmas):
print(f"Getting {algo} model for sigma = {sigma}")
print("-"*150)
# directories for saving results
fig_path = os.path.join(proj_path, 'plots/'+algo+'_sine'+str(sigma)+'_plot')
metrics_path = os.path.join(proj_path, 'metrics/indiv_metrics/'+algo+'_sine'+str(sigma)+'_metrics.pkl')
bm_eval_path = os.path.join(proj_path, 'logs/'+algo+'_sine'+str(sigma)+'_eval')
bm_path = os.path.join(original_proj_path,'models/'+algo+'_sine_'+str(sigma))
df = sine_data_dict[sigma]
# evaluating models on the validation period
valid_df = df.loc[(df.index >= valid_start_date) & ((df.index <= valid_end_date)), :]
if algo.lower() == 'ppo':
model = PPO.load(os.path.join(bm_path, 'best_model.zip'))
elif algo.lower() == 'a2c':
model = A2C.load(os.path.join(bm_path, 'best_model.zip'))
elif algo.lower() == 'dqn':
model = DQN.load(os.path.join(bm_path, 'best_model.zip'))
# metrics are saved indivudually for models, and together in a dataframe
# profit curves are saved in a dataframe
bm_metrics = evaluate_agent(valid_df, model, algo, sigma, window_size, seed, save=True, fig_savepath=fig_path, metrics_savepath=metrics_path)
metrics.append(bm_metrics)
# save metrics
metrics_df = pd.DataFrame(columns=['algorithm', 'sigma', 'cumulative_return', 'sharpe_ratio', 'daily_return', 'pct_in_market', 'max_drawdown'])
profit_df = pd.DataFrame()
for i, m in enumerate(metrics):
metrics_df.loc[i, :] = [m['algorithm'], m['sigma'], m['cumulative_return'], m['sharpe'],
m['daily_return'], m['pct_in_market'],
m['max_drawdown']]
profit_df[m['algorithm']+"_"+str(m['sigma'])] = m['profit_ts']
Getting ppo model for sigma = 0
------------------------------------------------------------------------------------------------------------------------------------------------------
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats: {<Actions.Sell: 0>: 63, <Actions.Buy: 1>: 59}
info: {'total_reward': 78.65792083740234, 'total_profit': 1.8467602467559816, 'position': <Positions.Long: 1>}
Plot of episode saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/plots/ppo_sine0_plot
Metrics saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/indiv_metrics/ppo_sine0_metrics.pkl
Getting ppo model for sigma = 4
------------------------------------------------------------------------------------------------------------------------------------------------------
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats: {<Actions.Sell: 0>: 57, <Actions.Buy: 1>: 65}
info: {'total_reward': 28.06151580810547, 'total_profit': 0.9136740394078292, 'position': <Positions.Long: 1>}
Plot of episode saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/plots/ppo_sine4_plot
Metrics saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/indiv_metrics/ppo_sine4_metrics.pkl
Getting ppo model for sigma = 8
------------------------------------------------------------------------------------------------------------------------------------------------------
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats: {<Actions.Sell: 0>: 57, <Actions.Buy: 1>: 65}
info: {'total_reward': -10.972724914550781, 'total_profit': 0.4865474160238129, 'position': <Positions.Long: 1>}
Plot of episode saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/plots/ppo_sine8_plot
Metrics saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/indiv_metrics/ppo_sine8_metrics.pkl
Getting ppo model for sigma = 12
------------------------------------------------------------------------------------------------------------------------------------------------------
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats: {<Actions.Sell: 0>: 61, <Actions.Buy: 1>: 61}
info: {'total_reward': -20.943382263183594, 'total_profit': 0.4280573411421025, 'position': <Positions.Long: 1>}
Plot of episode saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/plots/ppo_sine12_plot
Metrics saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/indiv_metrics/ppo_sine12_metrics.pkl
Getting ppo model for sigma = 16
------------------------------------------------------------------------------------------------------------------------------------------------------
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats: {<Actions.Sell: 0>: 69, <Actions.Buy: 1>: 53}
info: {'total_reward': 88.9033432006836, 'total_profit': 1.3227464360597074, 'position': <Positions.Short: 0>}
Plot of episode saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/plots/ppo_sine16_plot
Metrics saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/indiv_metrics/ppo_sine16_metrics.pkl
Getting ppo model for sigma = 20
------------------------------------------------------------------------------------------------------------------------------------------------------
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats: {<Actions.Sell: 0>: 61, <Actions.Buy: 1>: 61}
info: {'total_reward': -23.744342803955078, 'total_profit': 0.34459168614813473, 'position': <Positions.Long: 1>}
Plot of episode saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/plots/ppo_sine20_plot
Metrics saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/indiv_metrics/ppo_sine20_metrics.pkl
Getting a2c model for sigma = 0
------------------------------------------------------------------------------------------------------------------------------------------------------
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats: {<Actions.Sell: 0>: 51, <Actions.Buy: 1>: 71}
info: {'total_reward': -12.418289184570312, 'total_profit': 0.6079805829809454, 'position': <Positions.Long: 1>}
Plot of episode saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/plots/a2c_sine0_plot
Metrics saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/indiv_metrics/a2c_sine0_metrics.pkl
Getting a2c model for sigma = 4
------------------------------------------------------------------------------------------------------------------------------------------------------
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats: {<Actions.Sell: 0>: 18, <Actions.Buy: 1>: 104}
info: {'total_reward': -40.002601623535156, 'total_profit': 0.5415310699109291, 'position': <Positions.Long: 1>}
Plot of episode saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/plots/a2c_sine4_plot
Metrics saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/indiv_metrics/a2c_sine4_metrics.pkl
Getting a2c model for sigma = 8
------------------------------------------------------------------------------------------------------------------------------------------------------
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats: {<Actions.Sell: 0>: 62, <Actions.Buy: 1>: 60}
info: {'total_reward': -19.67327117919922, 'total_profit': 0.4657123927240038, 'position': <Positions.Long: 1>}
Plot of episode saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/plots/a2c_sine8_plot
Metrics saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/indiv_metrics/a2c_sine8_metrics.pkl
Getting a2c model for sigma = 12
------------------------------------------------------------------------------------------------------------------------------------------------------
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats: {<Actions.Sell: 0>: 81, <Actions.Buy: 1>: 41}
info: {'total_reward': 162.63055419921875, 'total_profit': 2.273235675514655, 'position': <Positions.Short: 0>}
Plot of episode saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/plots/a2c_sine12_plot
Metrics saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/indiv_metrics/a2c_sine12_metrics.pkl
Getting a2c model for sigma = 16
------------------------------------------------------------------------------------------------------------------------------------------------------
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats: {<Actions.Sell: 0>: 80, <Actions.Buy: 1>: 42}
info: {'total_reward': 134.2242546081543, 'total_profit': 2.401891438779098, 'position': <Positions.Short: 0>}
Plot of episode saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/plots/a2c_sine16_plot
Metrics saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/indiv_metrics/a2c_sine16_metrics.pkl
Getting a2c model for sigma = 20
------------------------------------------------------------------------------------------------------------------------------------------------------
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats: {<Actions.Sell: 0>: 63, <Actions.Buy: 1>: 59}
info: {'total_reward': 241.61168670654297, 'total_profit': 2.5144640154526643, 'position': <Positions.Short: 0>}
Plot of episode saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/plots/a2c_sine20_plot
Metrics saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/indiv_metrics/a2c_sine20_metrics.pkl
Getting dqn model for sigma = 0
------------------------------------------------------------------------------------------------------------------------------------------------------
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats: {<Actions.Sell: 0>: 12, <Actions.Buy: 1>: 110}
info: {'total_reward': -12.831001281738281, 'total_profit': 0.7488619185680739, 'position': <Positions.Short: 0>}
Plot of episode saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/plots/dqn_sine0_plot
Metrics saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/indiv_metrics/dqn_sine0_metrics.pkl
Getting dqn model for sigma = 4
------------------------------------------------------------------------------------------------------------------------------------------------------
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats: {<Actions.Sell: 0>: 22, <Actions.Buy: 1>: 100}
info: {'total_reward': -20.741127014160156, 'total_profit': 0.6200279561216626, 'position': <Positions.Short: 0>}
Plot of episode saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/plots/dqn_sine4_plot
Metrics saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/indiv_metrics/dqn_sine4_metrics.pkl
Getting dqn model for sigma = 8
------------------------------------------------------------------------------------------------------------------------------------------------------
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats: {<Actions.Sell: 0>: 31, <Actions.Buy: 1>: 91}
info: {'total_reward': 14.432968139648438, 'total_profit': 0.7946251650310626, 'position': <Positions.Short: 0>}
Plot of episode saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/plots/dqn_sine8_plot
Metrics saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/indiv_metrics/dqn_sine8_metrics.pkl
Getting dqn model for sigma = 12
------------------------------------------------------------------------------------------------------------------------------------------------------
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats: {<Actions.Sell: 0>: 36, <Actions.Buy: 1>: 86}
info: {'total_reward': 101.09674072265625, 'total_profit': 1.9535468168921157, 'position': <Positions.Short: 0>}
Plot of episode saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/plots/dqn_sine12_plot
Metrics saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/indiv_metrics/dqn_sine12_metrics.pkl
Getting dqn model for sigma = 16
------------------------------------------------------------------------------------------------------------------------------------------------------
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats: {<Actions.Sell: 0>: 52, <Actions.Buy: 1>: 70}
info: {'total_reward': 113.1202392578125, 'total_profit': 2.032953724304409, 'position': <Positions.Short: 0>}
Plot of episode saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/plots/dqn_sine16_plot
Metrics saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/indiv_metrics/dqn_sine16_metrics.pkl
Getting dqn model for sigma = 20
------------------------------------------------------------------------------------------------------------------------------------------------------
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats: {<Actions.Sell: 0>: 48, <Actions.Buy: 1>: 74}
info: {'total_reward': 194.22594451904297, 'total_profit': 5.79307107075949, 'position': <Positions.Short: 0>}
Plot of episode saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/plots/dqn_sine20_plot
Metrics saved to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/indiv_metrics/dqn_sine20_metrics.pkl
1.4.3 Mean Reversion¶
As a benchmark for comparison, we employ the Mean Reversion (MR), which is a simple and commonly used single-stock trading strategy. The MR strategy is based on the prior belief that a stock price will revert to its mean after some time. This financial strategy capitalizes on the tendency of prices to return to their historical average, which can be particularly effective in markets exhibiting stable and predictable behaviors. This strategy requires a moving average estimate, and some deviation threshold which generates buy and sell signals.
The mean is typically calculated using a moving average of the stock's prices over a specific period. Commonly, traders might use simple moving averages (SMA). The strategy involves identifying significant deviations from this average. When a stock's price strays too far from its mean—either above or below—this is viewed as an anomaly likely to correct itself.
Buy Signal: A buy signal is triggered when the stock price drops significantly below the mean. This is seen as an indication that the stock is temporarily undervalued and expected to rise back to its average.
Sell Signal: Conversely, a sell signal is generated when the stock price rises significantly above the mean, suggesting that the stock is temporarily overvalued and likely to decrease.
In the case of a periodic signal without long-term trend, correctly chosen thresholds will result in the optimal policy (Buy at the lowest, sell at the highest). This is demonstrated by using mean reversion on a noiseless sinusoidal price signal, as shown below. Given the periodic nature of the signals we are testing, and the zero-trend, mean reversion acts as an oracle strategy in the noiseless case and is powerful as long as the periodicity is not lost in the noise.
(Note: The images below will render when the repo is cloned and run locally. We encountered issues with image rendering on the GitHub website.) <img src ="images/noiseless_mr1.png" height="300" width="500" alt="noiseless_mr1">
Stock prices never resemble a noiseless sine wave, and as is expected, the performance of MR degrades as the signal approaches a white noise process. See the $\sigma = 20$ simulation as an example:
<img src ="images/noisy_mr1.png" height="300" width="500" alt="noisy_mr1">
1.4.4 Proof the Mean Reversion Policy is optimal for deterministic, periodic price signals¶
To prove that a mean reversion strategy is optimal if price movements follow a sine wave, we can model the stock price as
$$P_t = \alpha + \beta \times \sin(m \times t)$$Where the variables are as defined in Section 1.2. We see that the price reaches its extreme points like so:
- minimum at $P_t = \alpha - \beta$ when $sin(m \times t) = -1$
- maximum at $P_t = \alpha + \beta$ when $sin(m \times t) = 1$.
The Mean Reversion policy issues a buy signal when the price falls below the long-term moving average by a tuned threshold $-\gamma$, and a sell signal once the price exceeds the moving average by $\gamma$.
Given a periodic deterministic signal like the sine wave, we can tune $\gamma$ such that the buy signal is:
- Buy at $P_t = \alpha - \beta$, corresponding to $sin(m \times t) = -1$
- Sell at $P_t = \alpha + \beta$, corresponding to $sin(m \times t) = 1$
This profit represents the maximal exploitable price swing, given the periodic nature of the sine function. So when tuned, the MR policy is optimal for the periodic price signal.
MR policy¶
def mean_reversion_policy(state, prev_action, lower_th, upper_th):
"""
Determine action based on mean reversion strategy.
Args:
- state: list of the form [price, 1st difference, lower threshold, upper threshold]
The thresholds are pre-defined distances from the mean which trigger the
Buy or Sell action.
Returns:
- int: 1 for "buy", 0 for "sell"
"""
price = state[0]
mean = state[2]
if price < mean*lower_th:
action = 1 # Buy signal: price moves below mean by threshold
elif price > mean*upper_th:
action = 0 # Sell signal: price moves above mean by threshold
else:
action = prev_action
return action
Simulating stock trading with MR policy¶
validation=True
if validation:
proj_path='/content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation'
train_start_date = "2017-01-02"
valid_end_date = "2018-06-29"
results_table_path = os.path.join(proj_path, 'metrics/tables/val_metrics_table.csv')
profit_curve_path = os.path.join(proj_path, 'metrics/tables/val_profit_ts.csv')
val_metrics_path = os.path.join(proj_path, 'metrics/tables/val_metrics.csv')
window_size = 1
algo = 'mr'
metrics_lst = []
for sigma in sigmas:
action_stats = {Actions.Sell: 0, Actions.Buy: 0}
fig_path = os.path.join(proj_path, 'plots/'+algo+'_sine'+str(sigma)+'_plot')
metrics_path = os.path.join(proj_path, 'metrics/indiv_metrics/'+algo+'_sine'+str(sigma)+'_metrics.pkl')
bm_path = os.path.join(proj_path,'models/'+algo+'_sine_'+str(sigma))
bm_eval_path = os.path.join(proj_path, 'logs/'+algo+'_sine'+str(sigma)+'_eval')
df = sine_data_dict[sigma]
# hide validation data to avoid leakage
if validation:
# +10 to align the validation period with the RL models
backtest_df = df.loc[(df.index >= train_start_date) & (df.index <= valid_start_date), :]
start_index = tmp.loc[tmp.Date == valid_start_date, :].index.values[0]+10
end_index = tmp.loc[tmp.Date == valid_end_date, :].index.values[0]+1
else:
backtest_df = df.loc[(df.index >= train_start_date) & (df.index <= valid_end_date), :]
start_index = tmp.loc[tmp.Date == '2018-07-13', :].index.values[0]
end_index = len(df)
# further work: threshold tuning
upper_th = backtest_df.Close.quantile(0.75)/backtest_df.Close.mean()
lower_th = backtest_df.Close.quantile(0.25)/backtest_df.Close.mean()
# create environment
window_size = 1
tmp = df.reset_index(drop=True)
env = MyStocksEnvBM(df=df,
window_size=window_size,
frame_bound=(start_index, end_index))
observation, info = env.reset(seed=seed)
# simulate episode
prev_action = 0
while True:
action = mean_reversion_policy(observation[0], prev_action, lower_th, upper_th)
action_stats[Actions(action)] += 1
observation, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated
prev_action = action
if done:
break
env.close()
# get metrics and simulation plot
qs.extend_pandas()
net_worth = pd.Series(env.unwrapped.history['total_profit'], index=df.index[start_index+1:end_index])
returns = net_worth.pct_change().iloc[1:]
metrics = {}
metrics['algorithm'] = algo
metrics['sigma'] = sigma
metrics['profit_ts'] = net_worth
metrics['cumulative_return'] = net_worth[-1] - net_worth[0]
metrics['sharpe'] = qs.stats.sharpe(returns)
metrics['daily_return'] = qs.stats.expected_return(returns)
metrics['pct_in_market'] = action_stats[Actions.Buy] / (action_stats[Actions.Buy]+action_stats[Actions.Sell])
metrics['max_drawdown'] = qs.stats.max_drawdown(returns)
metrics_lst.append(metrics)
# save metrics and simulation plot
env.unwrapped.render_all_pretty()
plt.savefig(fig_path)
print(f"Plot of episode saved to {fig_path}")
with open(metrics_path, 'wb') as file:
pickle.dump(metrics, file)
print(f"Metrics saved to {metrics_path}")
baseline_metrics_df = pd.DataFrame(columns=['algorithm', 'sigma', 'cumulative_return', 'sharpe_ratio', 'daily_return', 'pct_in_market', 'max_drawdown'])
baseline_profit_df = pd.DataFrame()
for i, m in enumerate(metrics_lst):
baseline_metrics_df.loc[i, :] = [m['algorithm'], m['sigma'], m['cumulative_return'], m['sharpe'],
m['daily_return'], m['pct_in_market'],
m['max_drawdown']]
baseline_profit_df[m['algorithm']+"_"+str(m['sigma'])] = m['profit_ts']
assert baseline_profit_df.shape[0] == profit_df.shape[0]
Updating metrics tables to include the Mean Reversion metrics
# update the metrics table and profit_df to include the baseline
if metrics_df.loc[metrics_df.algorithm=='mr',:].shape[0] == 0:
metrics_df = pd.concat([metrics_df, baseline_metrics_df])
metrics_df.to_csv(results_table_path, index=True)
print(f"Saved updated metrics table to {results_table_path}")
if 'mr_0' not in profit_df.columns.tolist():
for col in baseline_profit_df.columns.tolist():
profit_df[col] = baseline_profit_df[col].values
profit_df.to_csv(profit_curve_path, index=True)
print(f"Saved updated profit curves to {profit_curve_path}")
Saved updated metrics table to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/tables/val_metrics_table.csv Saved updated profit curves to /content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation/metrics/tables/val_profit_ts.csv
1.5 Analysis¶
In this section, we compare performance between the RL models and the Mean Reversion benchmark using several metrics.
During model training, models were selected based on their performance on the validation dataset, which was unseen during training. We then evaludated the models on another out-of-time test set. We analyse model performance on both periods, drawing parallels between performance on both to better understand model behaviour, and to discuss model generalization to unseen periods. This is a topic that is discussed further, in the context of forex trading using RL, in Section 2.
This analysis is split into two parts. First, we focus in on comparing the models in terms of cumulative returns, which is essentially the reward that the models are trained to maximise. We then supplement this with several additional metrics for a comprehensive analysis of model performance. We also compare model performance on the validation and out-of-time (test) periods, discussing generalisaation of the models to unseen time periods.
Metrics¶
We briefly introduce the metrics used in our analysis.
- Cumulative Return: % Profit, updated over time.
- Max Drawdown: A drawdown is a peak-to-trough decline during a period for an investment. Max drawdown records the worst drawdown as a %.
- % in Market: The % of time in the trading period where shares are held. This provides a relative measure of trading activity between the models.
- Sharpe Ratio: The Sharpe ratio is a reward-to-risk ratio. It measures how much excess return is recieved for the extra volatility endured by holding an asset. It is calculated as $$S = \frac{R_p - R_f}{\sigma_p}$$ where $R_p$ is the realised P/L from the trade, $R_f$ is the theoretical rate of return of a zero-risk investment, and $\sigma$ is the standard deviation of the asset price.
Plot functions¶
import warnings
warnings.simplefilter("ignore")
def profit_plots(metric_df, profits_df, ignore_mr=False, figsize=(10, 20), grid=True):
algos = list(algo for algo in metric_df.algorithm.unique() if algo != 'mr')
sigmas = list(sorted(metric_df.sigma.unique()))
fig, axs = plt.subplots(len(sigmas), figsize=figsize, sharex=True, sharey=True)
for sigma_idx, sigma in enumerate(sigmas):
ax = axs[sigma_idx]
ax.set_title(f"$\sigma$ = {sigma}")
if not ignore_mr:
profit_curve_mr = profits_df[['Date', f'mr_{sigma}']].copy()
profit_curve_mr['Date'] = pd.to_datetime(profit_curve_mr['Date'])
profit_curve_mr.set_index("Date", inplace=True)
profit_curve_mr.plot(ax=ax)
for algo_idx, algo in enumerate(algos):
profit_curve_algo = profits_df[['Date',f'{algo}_{sigma}']].copy()
profit_curve_algo['Date'] = pd.to_datetime(profit_curve_algo['Date'])
profit_curve_algo.set_index("Date", inplace=True)
profit_curve_algo.plot(ax=ax)
ax.grid(grid)
if ignore_mr:
ax.legend(algos)
else:
ax.legend(['mr'] + algos)
ax.set_xlabel("Time")
ax.set_ylabel("Cumulative Return")
fig.tight_layout()
plt.minorticks_off()
plt.show(fig)
plt.close()
def profit_plots_separate(metric_df, profits_df, ignore_mr = False, figsize=(20, 20), grid=True):
algos = list(algo for algo in metric_df.algorithm.unique() if algo != 'mr')
sigmas = list(sorted(metric_df.sigma.unique()))
fig, axs = plt.subplots(len(sigmas), len(algos), figsize=figsize, sharex=True, sharey=True)
for algo_idx, algo in enumerate(algos):
for sigma_idx, sigma in enumerate(sigmas):
ax = axs[sigma_idx, algo_idx]
profit_curve_algo = profits_df[['Date',f'{algo}_{sigma}']].copy()
profit_curve_algo['Date'] = pd.to_datetime(profit_curve_algo['Date'])
profit_curve_algo.set_index("Date", inplace=True)
profit_curve_algo.plot(ax=ax)
profit_curve_mr = profits_df[['Date', f'mr_{sigma}']].copy()
profit_curve_mr['Date'] = pd.to_datetime(profit_curve_mr['Date'])
profit_curve_mr.set_index("Date", inplace=True)
profit_curve_mr.plot(ax=ax)
ax.grid(grid)
ax.set_title(f"{algo.upper()} , $\sigma$ = {sigma}")
ax.legend([algo, "mr"])
ax.set_xlabel("Time")
ax.set_ylabel("Cumulative Return")
fig.tight_layout()
plt.minorticks_off()
plt.show(fig)
plt.close()
def metric_plots(metric_df, grid=True):
metric_names = [c for c in metric_df.columns if c != 'algorithm' and c != 'sigma']
algos = list(algo for algo in metric_df.algorithm.unique() if algo != 'mr')
fig, axs = plt.subplots(len(metric_names), figsize=(10, 20), sharex=True, sharey=False)
for metric_idx, metric_name in enumerate(metric_names):
ax = axs[metric_idx]
name = " ".join(metric_name.split("_")) # reformat metric name
ax.set_title(f"{name} by $\sigma$")
ax.set_xlabel("$\sigma$")
ax.set_ylabel(f"{metric_name}")
for algo_idx, algo in enumerate(algos):
algo_df = metric_df.query(f'algorithm == "{algo}"')[['sigma',metric_name]].set_index('sigma')
algo_df.plot(ax=ax)
ax.grid(grid)
ax.legend(algos, loc='upper right')
ax.set_xlabel("$\sigma$")
ax.set_xticks([0, 4, 8, 12, 16, 20])
ax.set_ylabel(name)
plt.minorticks_off()
plt.show()
plt.close()
def metric_plots_separate(metric_df, grid=True):
metric_names = [c for c in metric_df.columns if c != 'algorithm' and c != 'sigma']
algos = list(algo for algo in metric_df.algorithm.unique() if algo != 'mr')
fig, axs = plt.subplots(len(metric_names), len(algos), figsize=(20, 20), sharex=True, sharey=False)
for algo_idx, algo in enumerate(algos):
for metric_idx, metric_name in enumerate(metric_names):
ax = axs[metric_idx, algo_idx]
name = " ".join(metric_name.split("_")) # reformat metric name for plot
ax.set_title(f"{algo.upper()}, {name} by $\sigma$")
ax.legend([algo, "mr"])
ax.set_xlabel("$\sigma$")
ax.set_ylabel(f"{name}")
algo_df = metric_df.query(f'algorithm == "{algo}"')[['sigma',metric_name]].set_index('sigma')
algo_df.plot(ax=ax)
mr_df = metric_df.query(f'algorithm == "mr"')[['sigma',metric_name]].set_index('sigma')
mr_df.plot(ax = ax)
ax.grid(grid)
ax.set_xticks([0, 4, 8, 12, 16, 20])
ax.legend([algo, 'mr'], loc='upper right')
ax.set_xlabel("$\sigma$")
fig.tight_layout()
plt.minorticks_off()
plt.show()
plt.close()
def val_test_plots(grid=True):
metric_names = [c for c in oot_metrics.columns if c != 'algorithm' and c != 'sigma']
algos = list(algo for algo in oot_metrics.algorithm.unique() if algo != 'mr')
fig, axs = plt.subplots(len(metric_names), len(algos), figsize=(20, 20), sharex=True, sharey=False)
for algo_idx, algo in enumerate(algos):
for metric_idx, metric_name in enumerate(metric_names):
name = " ".join(metric_name.split("_"))
ax = axs[metric_idx, algo_idx]
ax.set_title(f"{algo.upper()}, {name} by $\sigma$")
ax.legend([algo, "mr"])
ax.set_xlabel("$\sigma$")
ax.set_ylabel(f"{metric_name}")
algo_df = oot_metrics.query(f'algorithm == "{algo}"')[['sigma',metric_name]].set_index('sigma')
algo_df.plot(ax=ax)
mr_df = val_metrics.query(f'algorithm == "{algo}"')[['sigma',metric_name]].set_index('sigma')
mr_df.plot(ax = ax)
ax.grid(grid)
ax.set_xticks([0, 4, 8, 12, 16, 20])
plt.minorticks_off()
ax.legend([algo.upper(), "val "+algo.upper()], loc='upper right')
plt.show()
plt.close()
# change directories as required
val_proj_path='/content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/validation'
oot_proj_path = '/content/drive/MyDrive/OneDrive_clone/Documents/stats_modules/ST455 (Reinforcement Learning)/final_project/'
oot_results_table_path = os.path.join(oot_proj_path, 'metrics/tables/metrics_table.csv')
oot_profit_curve_path = os.path.join(oot_proj_path, 'metrics/tables/profit_ts.csv')
val_results_table_path = os.path.join(val_proj_path, 'metrics/tables/val_metrics_table.csv')
val_profit_curve_path = os.path.join(val_proj_path, 'metrics/tables/val_profit_ts.csv')
# load in results
oot_metrics = pd.read_csv(oot_results_table_path).drop(columns=['daily_return'])
oot_profit = pd.read_csv(oot_profit_curve_path)
val_metrics = pd.read_csv(val_results_table_path).iloc[:, 1:].drop(columns=['daily_return'])
val_profit = pd.read_csv(val_profit_curve_path)
1.5.1 Cummulative Returns¶
From the plots below, we see that Mean Reversion benchmark is more profitable than RL models when $\sigma = 0$ (noiseless sinusoidal price movements). The PPO performance was also promising here. As the signal gets more noisy, we see RL models outperm the benchmark in several experiments. Notably, A2C outperforms MR when $\sigma = 12$ both in the validation and test periods with significant upturns.
print("RL Algorithms vs Benchmark: - Cumulative returns on Validation Set")
profit_plots_separate(val_metrics, val_profit)
RL Algorithms vs Benchmark: - Cumulative returns on Validation Set
print("RL Algorithms vs Benchmark: - Cumulative returns on Test Set")
profit_plots_separate(oot_metrics, oot_profit)
RL Algorithms vs Benchmark: - Cumulative returns on Test Set
In both the validation set, PPO shows superior performance in the easier price movements (low $\sigma$). For more volatile signals (high $\sigma$), both DQN and A2C overtake to PPO in terms of cumulative returns. In the metrics section, we consider risk-conscious metrics such as Sharpe ratio, and contrast with the cumulative return results.
print("RL Algorithms comparison: Cumulative returns on Validation Set")
profit_plots(val_metrics, val_profit, ignore_mr=True, figsize=(8, 20))
RL Algorithms comparison: Cumulative returns on Validation Set
Model comparison on the test set is similar, but DQN performance is noticeably worse in the test set, suggesting that the learnt policy deosn't generalise to unseen periods as well as A2C.
print("RL Algorithms comparison: Cumulative returns on Test Set")
profit_plots(oot_metrics, oot_profit, ignore_mr=True, figsize=(8, 20))
RL Algorithms comparison: Cumulative returns on Test Set
1.5.2 Metrics¶
By plotting the metrics for each experiment, we can now see clearly the trend of A2C and DQN beating the bench mark in terms of returns when price is more volatile. We observe the same crossover in sharpe ratio, suggesting that even after we take into account the risk posed by the increased volatility, these models performed better trades in high $\sigma$ experiments. We see from the % in market metric that these models are also spending comparitively less time points in-market when the price is more volatile.
Although, we see that their drawdowns are consistently more severe than the benchmark, which indicates instablity. More on this in the conclusion.
print("RL Algorithms vs Benchmark: - Metrics on Validation Set by \u03C3")
metric_plots_separate(val_metrics)
RL Algorithms vs Benchmark: - Metrics on Validation Set by σ
Notably, we see consistent improvements in sharpe ratio for A2C and DQN for more volatile price signal, in contrast with PPO which worsens. The drawdowns worse across all models as volatility increases.
print("RL Algorithms comparison: Metrics on Validation Set by \u03C3")
metric_plots(val_metrics)
RL Algorithms comparison: Metrics on Validation Set by σ
On the test set, the performance of A2C and DQN are relatively worse than in validation, which suggests that the generalization error of these models is unattractive. In the conclusion, we discuss several steps that may alleviate this problem.
print("RL Algorithms vs Benchmark: - Metrics on Test Set by \u03C3")
metric_plots_separate(oot_metrics)
RL Algorithms vs Benchmark: - Metrics on Test Set by σ
In contrast to the validation results, DQN is now outperformed by PPO.
print("RL Algorithms comparison: Metrics on Test Set by \u03C3")
metric_plots(oot_metrics)
RL Algorithms comparison: Metrics on Test Set by σ
Here we observe the generalization error for the RL models directly, by comparing performance on the validation and out-of-time test periods. The disparities betweem performance on the time periods give insight into model stability and genaralization ability.Note that the validation period was used to select the best model, so positive bias is expected
Generally, model performance in the validation and test sets are correlated, with the DQN showing the strongest disparities. This is best seen in the cumulative return plot in the top right, where the generalization error grows sharply as $\sigma$ increases.
val_test_plots()
1.5.3 Analysis Results¶
From the plots above observe distinct differences between model performances in every experiment. We also explored the interesting changes in model behaviour as we increase the noise parameter.
At low sigmas, the mean reversion benchmark ("mr") outperforms all of the algorithms. This is expected as at low noise levels, MR is expected to be optimal or near optimal, since it deterministically buys at low prices, and sells at high prices (See proof in Section 1.4.4). As noise increases we can see that the advantage weakens, with benchmark underperforming DQN and A2C. In the out-of-time test, A2C still beats the benchmark in high $\sigma$ simulations. At low $\sigma$, PPO outperforms the competing RL models behaving similarly to mean reversion, gradually losing the performance advantage as the level of noise increases, underperforming other algorithms for high noise; this effect being more pronounced in validation than in the out-of-time period. Cumulative returns correlate with the noisiness of the price signal. That is, in the low to mid noise experiments ($\sigma = 0$ to $8$), the cumulative return exhibits low variance.
In the validation period, as noise increases, the performance of both DQN and A2C improve, beating Mean Reversion and PPO. We observe the same performance crossover in sharpe ratio, suggesting that even after we take into account the risk posed by the increased volatility, DQN and A2C performed better trades in high $\sigma$ experiments. We see from the % in market metric that these models are also spending comparitively less time points in-market when the price is more volatile. But In the test set, DQN becomes unprofitable at high $\sigma$ values, with A2C maintaining profitability for $\sigma = 12$ and $\sigma = 20$.
It was demonstrated in the OpenRL benchmark that A2C model's are harder to train generally than other RL algorithms like PPO Huang et al. 2024. But when the A2C was successfully trained, we see promising results; even for the noisy signal (See A2C for $\sigma=12$). In fact, we observe better performance for AUC on noisy signals, where the price has a higher fluctuation frequency. This was the case on both the validation set and the out-of-time test set. But the cumulative returns were also quite volatile, as we observe in the sharp profit swings in the high $\sigma$ experiments. The impact of price movement frequency on A2C performance was not explored in-depth here, and could be an interesting area for further work.
Generalization of RL models to unseen periods¶
We then examine the disparity between validation and test performance. As expected, the models performed better in the validation period than the test. The difference was most dramatic in the DQN performance, as is shown in the third column of plots above. This indicates that the learnt policy is sensitive to exact states. This is not a desirable behaviour in RL trading agents, so steps must be taken to avoid this in practice.
Better performance may be achieved through hyperparameter tuning, such as epsilon value in PPO, or the size of the replay window in DQN. These may be optimised via grid search or baysian optimization. We may also try different neural network architectures in the case of DQN. In the second half of this project, we explore several of these ideas using real market data.
2. Application of DQN, PPO and A2C to Forex trading ($\text{EUR} \Leftrightarrow \text{USD}$)¶
Forex trading involves the simultaneous buying and selling of world currencies on a decentralized global market.
*EUR_USD* is the most liquid currency pair, which means it has the highest trading volume in the forex markets. (João Carapuço et al., 2018)
This high liquidity results in tighter spreads and lower transaction costs, making it attractive for frequent trading, such as scalping or high-frequency trading strategies. That is why using reinforcement learning algorithms provides an opportunity to find the strategies profitable for traders choosing this pair.
By modeling the EUR/USD pair, traders can capitalize on its high liquidity, sensitivity to major economic events, and the robustness of technical patterns
Rationale for the choice of the year
The year 2017 was pivotal for EUR/USD traders due to political and economic events in Europe and the United States that influenced forex markets (both for USD and EUR).
In Europe, the political landscape stabilized following Emmanuel Macron's victory in the French presidential elections, boosting investor confidence in the euro. The Eurozone experienced a robust economic recovery, outperforming expectations and making the favorable view of euro
Conversely, the U.S. faced political uncertainties with the onset of Donald Trump presidency. This uncertainty often led to a weaker dollar.
Also, monetary policy paths diverged significantly. The European Central Bank continued quantitative easing program, while the Federal Reserve began tightening monetary policy.
This played a crucial role in the fluctuations and heightened trading activity of the EUR/USD currency pair throughout 2017. That is why when considering this particular trading pair, it is of interest to investigate the trading activity of 2017, focusing on month with historically different market behaviour.
from google.colab import drive
drive.mount('/content/drive')
dataset = pd.read_csv('/content/drive/MyDrive/pyfinance.csv')
dataset['date'] = pd.to_datetime(dataset['date'])
desired_month = 3 # For March - final month in a financial year
desired_year = 2017
df = dataset[(dataset['date'].dt.month == desired_month) & (dataset['date'].dt.year == desired_year)]
df.set_index('date', inplace=True)
df
window_size = 180 # Example: looking back 180 minutes (3 hours)
start_date = "2017-03-01"
end_date = "2017-03-31"
filtered_data_dqn = df[start_date:end_date]
start_index = window_size
end_index = len(filtered_data_dqn)
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
#Seed
import random
import torch
seed = 42
np.random.seed(seed)
random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed)
2.1. DQN architecture impact in investment/trading strategy¶
The objective of this part is to investigate the impact of varying DQN architectures on the distribution of trading positions (Long, Short, Flat) and actions (Buy, Hold, Sell, Double Buy, Double Sell) over time. We also aim to explore how these distributions evolve as the model experiences more market scenarios, with the hypothesis that an increase in the number of training timesteps will lead to a more diversified action distribution. This hypothesis rests on the premise that extended exposure to a wide array of market conditions allows the DQN to learn a more nuanced strategy, thereby reducing the likelihood of overfitting to specific market trends and increasing the model's adaptability. Chen & Gao, 2019 researched the impact of the number of layers on the performance of the model. This study closes the gap with the dependence of actions and positions distributions in answer to layers enhancements in DQN architecture.
2.1.1. Custom Environment (upgrading gym_anytrading with new positions and actions)¶
Action Space:
Discreteaction space with 5 possible actions (DOUBLE_SELL,SELL,HOLD,BUY,DOUBLE_BUY) - more complex compared to thegym_anytradingenvironment, which typically only hasBuyandSellactions.State/Position Representation:
SHORT,FLAT, andLONG, whilegym_anytradingenvironment doesn't include aFLATposition. In previous environment, if previous position was 'SHORT' (potentially selling asset) but in the next period trader did not do any actions, the position would still be 'SHORT', it changes only when action happens. That is why adding a 'FLAT' is of interest. (Abdillah Baradja et al., 2023)Transformation Function:
transformfunction updates the position based on the current position and action taken, with additional logic to handle theDOUBLE_SELLandDOUBLE_BUYReward Calculation: considers both the direction of the position and a trade fee percentage
Observation Space: includes several financial indicators (prices, highs, lows, volumes) along with the current position and normalized time since the last trade (gym_anytrading environment only covers closing price and price differences)
Maximum Possible Profit:assume a perfect foresight strategy (buying at the lowest price and selling at the highest)
Double Buy :
- Returning Borrowed Shares: if the agent has any shares borrowed from a securities firm (a short selling), they are returned
- Purchasing New Shares: the agent uses all available cash to purchase new shares at the current market price
This action is aggressive, as it not only covers any existing short positions but also commits all cash reserves to new stock purchases, increasing risk if the market moves unfavorably
Double Sell:
- Selling Owned Stocks: sells all the stocks they currently own, converting all held equities into cash (basic sell action)
- Selling Borrowed Stocks: sells stocks that have been borrowed from a securities company with the obligation to repurchase them later. The goal here is to benefit from a decline in stock prices. This is agressive and risky, because trader should be really confident in future market fluctuations.
class Actions(Enum):
DOUBLE_SELL = 0
SELL = 1
HOLD = 2
BUY = 3
DOUBLE_BUY = 4
class Positions(Enum):
SHORT = -1
FLAT = 0
LONG = 1
def transform(position: Positions, action: int) -> (Positions, bool):
if action == Actions.SELL.value:
return (Positions.SHORT if position == Positions.FLAT else Positions.FLAT, True)
elif action == Actions.BUY.value:
return (Positions.LONG if position == Positions.FLAT else Positions.FLAT, True)
elif action == Actions.DOUBLE_SELL.value:
return (Positions.SHORT, True)
elif action == Actions.DOUBLE_BUY.value:
return (Positions.LONG, True)
return (position, False)
class CustomEnv(gym.Env):
metadata = {'render.modes': ['human']}
def __init__(self, df, window_size, frame_bound, seed = None):
super().__init__()
self.seed(seed)
self.df = df
self.window_size = window_size
self.frame_bound = frame_bound
self.trade_fee_bid_percent = 0.0001
self.trade_fee_ask_percent = 0.0001
self.action_space = spaces.Discrete(len(Actions))
self._action_history = []
self._position_history = []
self.prices, self.signal_features = self._process_data()
self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(self._get_observation_shape(),), dtype=np.float32)
self.reset()
def seed(self, seed=None):
self.np_random, seed = gym.utils.seeding.np_random(seed)
return [seed]
def _process_data(self):
prices = self.df['Close'][self.frame_bound[0]-self.window_size:self.frame_bound[1]].values
highs = self.df['High'][self.frame_bound[0]-self.window_size:self.frame_bound[1]].values
lows = self.df['Low'][self.frame_bound[0]-self.window_size:self.frame_bound[1]].values
volumes = self.df['Volume'][self.frame_bound[0]-self.window_size:self.frame_bound[1]].values
diffs = np.insert(np.diff(prices), 0, 0) #
avg_prices = (highs + lows + prices) / 3
signal_features = np.column_stack((prices, diffs, highs, lows, avg_prices, volumes))
return prices, signal_features
def _calculate_reward(self, action):
current_price = self.prices[self.current_tick]
last_trade_price = self.prices[self.last_trade_tick]
price_diff = current_price - last_trade_price
reward = 0.0
if self.position == Positions.LONG:
reward = (price_diff - self.trade_fee_ask_percent) * 10000
elif self.position == Positions.SHORT:
reward = (-price_diff - self.trade_fee_bid_percent) * 10000
return reward
def _get_observation(self):
base_obs = self.signal_features[self.current_tick - self.window_size + 1 : self.current_tick + 1].flatten()
additional_data = np.array([self.position.value, (self.current_tick - self.last_trade_tick) / self.eps_length])
full_obs = np.concatenate([base_obs, additional_data])
expected_shape = self._get_observation_shape()
return full_obs
def _get_observation_shape(self):
num_features_per_tick = self.signal_features.shape[1]
total_features = num_features_per_tick * self.window_size + 2 # +2 for additional data
return total_features
def step(self, action):
self.position, trade = transform(self.position, Actions(action).value)
reward = self._calculate_reward(action) if trade else 0
self.total_reward += reward
correct_index = self.current_tick + self.frame_bound[0] - self.window_size
self._action_history.append((action, correct_index))
self._position_history.append((self.position.value, correct_index))
self.current_tick += 1
done = self.current_tick >= self.frame_bound[1]
obs = self._get_observation()
info = {"trade_made": trade}
return obs, reward, done, info
def reset(self):
self.current_tick = self.window_size
self.last_trade_tick = self.current_tick - 1
self.position = Positions.FLAT
self.total_reward = 0
self.prices, self.signal_features = self._process_data()
self.eps_length = self.frame_bound[1] - self.frame_bound[0]
self._action_history.clear()
self._position_history.clear()
return self._get_observation()
def max_possible_profit(self):
# Buy at the lowest, sell at the highest
min_price = np.min(self.prices)
max_price = np.max(self.prices)
if min_price == max_price:
return 0
return (max_price - min_price) / min_price
def render_positions(self, mode='human', close=False):
if close:
plt.close()
return
if mode == 'human':
plt.figure(figsize=(15, 7))
plt.plot(self.prices, label='Price', color='blue')
short_indices = [index for position, index in self._position_history if position == Positions.SHORT.value]
flat_indices = [index for position, index in self._position_history if position == Positions.FLAT.value]
long_indices = [index for position, index in self._position_history if position == Positions.LONG.value]
if short_indices:
plt.scatter(short_indices, [self.prices[i] for i in short_indices], color='red', label='Short', marker='v', alpha=0.5)
if flat_indices:
plt.scatter(flat_indices, [self.prices[i] for i in flat_indices], color='gray', label='Flat', marker='o', alpha=0.5)
if long_indices:
plt.scatter(long_indices, [self.prices[i] for i in long_indices], color='green', label='Long', marker='^', alpha=0.5)
plt.title('Position Changes Over Price')
plt.xlabel('Ticks')
plt.ylabel('Price')
plt.legend()
plt.show()
def render_action_distribution(self):
action_counts = Counter([action for action, _ in self._action_history])
actions = list(action_counts.keys())
counts = list(action_counts.values())
total = sum(counts)
percentages = [count / total * 100 for count in counts]
labels = [Actions(action).name for action in actions]
fig, ax = plt.subplots()
ax.pie(percentages, labels=labels, autopct='%1.1f%%', startangle=90)
ax.axis('equal') # Equal aspect ratio ensures the pie is circular.
ax.set_title('Action Distribution')
plt.show()
def render_position_distribution(self):
position_counts = Counter([position for position, _ in self._position_history])
positions = list(position_counts.keys())
counts = list(position_counts.values())
total = sum(counts)
percentages = [count / total * 100 for count in counts]
labels = [Positions(position).name for position in positions]
fig, ax = plt.subplots()
ax.pie(percentages, labels=labels, autopct='%1.1f%%', startangle=90)
ax.axis('equal') # Equal aspect ratio ensures the pie chart is circular.
ax.set_title('Position Distribution')
plt.show()
def render_distributions(self):
position_counts = dict(Counter([pos for pos, idx in self._position_history]))
positions = list(position_counts.keys())
action_counts = dict(Counter([act for act, idx in self._action_history]))
actions = list(action_counts.keys())
fig, axs = plt.subplots(1, 2, figsize=(15, 7))
labels_pos = [Positions(position).name for position in positions]
labels_act = [Actions(action).name for action in actions]
axs[0].pie(position_counts.values(), labels=labels_pos, autopct='%1.1f%%')
axs[0].set_title('Positions Distribution')
axs[1].pie(action_counts.values(), labels=labels_act, autopct='%1.1f%%')
axs[1].set_title('Actions Distribution')
plt.show()
2.1.2 Model Training¶
env = CustomEnv(df=filtered_data_dqn, window_size=window_size, frame_bound=(start_index, end_index), seed = seed)
env.reset()
vec_env = DummyVecEnv([lambda: env]) # Wrap the environment
vec_env.seed(seed)
/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/vec_env/patch_gym.py:49: UserWarning: You provided an OpenAI Gym environment. We strongly recommend transitioning to Gymnasium environments. Stable-Baselines3 is automatically wrapping your environments in a compatibility layer, which could potentially cause issues. warnings.warn(
[42]
from stable_baselines3 import DQN
from stable_baselines3.common.vec_env import DummyVecEnv
model = DQN(
policy="MlpPolicy",
env=vec_env,
seed = seed,
learning_rate=0.0005,
buffer_size=50000,
learning_starts=1000,
batch_size=32,
tau=1.0,
gamma=0.99,
train_freq=4,
gradient_steps=1,
exploration_fraction=0.1,
exploration_initial_eps=1.0,
exploration_final_eps=0.1,
max_grad_norm=10,
verbose=1,
tensorboard_log="./dqn_forex_tensorboard/",
)
total_timesteps = len(filtered_data_dqn)-2*window_size
model.learn(total_timesteps=total_timesteps)
model.save("dqn_forex_trading_model")
Using cpu device Logging to ./dqn_forex_tensorboard/DQN_1
env.render_positions()
This demonstrates that closer to the second half, the agent starts to be more cautious with selling: it is more frequently in long position, probably waiting for a favourable state of market. This behaviour shows that the model learns how to fit the market behaviour and propose the better strategy, depending on the conditions
env.render_distributions()
2.1.3 Comparison DQN Architectures for timestep = 10000¶
Bias of agent because of insufficient training period
- Depending on the number of training timesteps, the changes in architecture not be significant as the agent would not learn the patterns properly, rather fitting to the unoptimal strategy in most cases. It sees limited data, it starts making conclusions without experience.
Positions Distribution¶
- SHORT around 90% for most the architectures: The agent is spending most of the time in the SHORT position. This could mean the agent has learned that holding a SHORT position is more often the profitable action given the dataset and market conditions simulated.
- LONG and FLAT: The agent seldom holds a LONG position or stays FLAT. The limited use of these positions suggests that the agent may not be finding many profitable opportunities to go LONG or to stay out of the market.
This issue with profitability is reached via Double Sell action - the agent starts to agressively sell the assets he does not own - borrows. That is why there is technically an obligation to buy them back later - creating exposure of risk. That is why this action may seem favourable for the trader only if they know about further fluctuations. In conclusion, at 10000 training time steps the learnt policies are generally not profitable - the policies considers the current state of the market providing the possibility to gain fast money by DOUBLE SELL action. Despite this, the actual profit would be measured after these liabilities would be closed. That is why training for more steps is important.
Actions Distribution¶
DOUBLE_SELL: The agent overwhelmingly favors the DOUBLE_SELL action. This action corresponds to entering or maintaining a SHORT position. This could indicate a strong bias in the agent's strategy toward expecting a declining market.
SELL, DOUBLE_BUY, HOLD, BUY: These actions are taken relatively infrequently, indicating that the agent has a strong preference for the DOUBLE_SELL action over others.
the DOUBLE_SELL action always moves to SHORT regardless of the current position, and SELL action would go to SHORT if the position was FLAT, otherwise, it would go to FLAT. -- This is a modification to initial gym_anytrading environment, allowing for short-selling.
When printing the render plot, it can be seen that most BUY actions are done in the beginning where there is a downward trend in prices, that is why the agent might then conclude that the trend would be repeated or even worsen and starts SHORT by DOUBLE SELL (Agent is not exploring enough).
save_dir = "/content/drive/My Drive/Colab Notebooks/saved_models_RL"
if not os.path.exists(save_dir):
os.makedirs(save_dir)
def create_dqn_model(env, net_arch):
return DQN("MlpPolicy", vec_env, policy_kwargs=dict(net_arch=net_arch), verbose=1, seed = seed)
architectures = [
[32, 64, 64, 32], # Four layers with increasing then decreasing units
[128, 256, 128], # A bottleneck architecture
[64, 128, 128, 256], # A deep network with increasing units
[400, 300], # DeepMind's DQN architecture
[32, 32]
]
for i, arch in enumerate(architectures):
env = CustomEnv(df=filtered_data_dqn, window_size=window_size, frame_bound=(start_index, end_index), seed = seed)
env.reset()
vec_env = DummyVecEnv([lambda: env])
vec_env.seed(seed)
model_dqn = create_dqn_model(vec_env, arch)
model_dqn.learn(total_timesteps=10000)
model_path = os.path.join(save_dir, f"dqn_model_arch_{i}")
model_dqn.save(model_path)
print(f"Architecture {arch}")
env.render_distributions()
/usr/local/lib/python3.10/dist-packages/stable_baselines3/common/vec_env/patch_gym.py:49: UserWarning: You provided an OpenAI Gym environment. We strongly recommend transitioning to Gymnasium environments. Stable-Baselines3 is automatically wrapping your environments in a compatibility layer, which could potentially cause issues. warnings.warn(
Using cpu device Architecture [32, 64, 64, 32]
Using cpu device Architecture [128, 256, 128]
Using cpu device Architecture [64, 128, 128, 256]
Using cpu device Architecture [400, 300]
Using cpu device Architecture [32, 32]
env.render_positions()
2.1.4. Comparison DQN Architectures for full timesteps¶
The choice of DQN architecture can be somewhat analogous to different investment styles in financial markets.
Smaller architectures ([32, 32]) resemble a more conservative or diversified investment strategy. They avoid large bets on specific market movements, which could be akin to a balanced mutual fund that spreads risk across various assets.
Moderate architectures ([32, 64, 64, 32]) comparable to more active investment approach, trying to capitalize on market inefficiencies but without extreme positions: growth-oriented investment fund that looks for value but stays diversified.
Larger, more complex architectures ([128, 256, 128], [64, 128, 128, 256], [400, 300]) reflect a more aggressive or specialized investment style to maximize returns based on specific market predictions. These networks have the capacity to capture intricate patterns in the data, potentially leading to higher risk-reward trades.
save_dir = "/content/drive/My Drive/Colab Notebooks/saved_models_RL2"
if not os.path.exists(save_dir):
os.makedirs(save_dir)
def create_dqn_model(env, net_arch):
return DQN("MlpPolicy", vec_env, policy_kwargs=dict(net_arch=net_arch), verbose=1)
architectures = [
[32, 64, 64, 32], # Four layers with increasing then decreasing units
[128, 256, 128], # A bottleneck architecture
[64, 128, 128, 256], # A deep network with increasing units
[400, 300], # DeepMind's DQN architecture
[32, 32]
]
for i, arch in enumerate(architectures):
env = CustomEnv(df=filtered_data_dqn, window_size=window_size, frame_bound=(start_index, end_index), seed = seed)
env.reset()
vec_env = DummyVecEnv([lambda: env])
vec_env.seed(seed)
model = create_dqn_model(vec_env, arch)
model.learn(total_timesteps=len(filtered_data_dqn)-2*window_size)
model_path = os.path.join(save_dir, f"dqn_model_arch_{i}")
model_dqn.save(model_path)
print(f"Architecture {arch}")
env.render_distributions()
Using cpu device Architecture [32, 64, 64, 32]
Using cpu device Architecture [128, 256, 128]
Using cpu device Architecture [64, 128, 128, 256]
Using cpu device Architecture [400, 300]
Using cpu device Architecture [32, 32]
DQN Results¶
[32, 64, 64, 32]: Agent predominantly takes SHORT positions but also explores LONG and FLAT to some extent. The action distribution shows a balance between BUY and DOUBLE_SELL, indicating a policy that is neither overly conservative nor excessively risk-taking.
[128, 256, 128]: LONG position is significantly dominant, and the action distribution leans heavily towards BUY, suggesting the agent learned a strong bias towards bullish market conditions or that the data has a pattern that frequently rewards the LONG position.
[64, 128, 128, 256]: This architecture presents a more balanced position distribution. The action distribution is still BUY-focused but includes a non-negligible proportion of DOUBLE_SELL, showing some level of caution or risk management.
[400, 300]: The agent with this architecture holds LONG positions most of the time, indicating a strong bullish bias in its policy. The actions are dominated by BUY and DOUBLE_SELL, suggesting a possible alternating strategy between holding assets and taking advantage of expected price declines.
[32, 32]: The positions and actions are more evenly distributed here compared to other architectures. The DOUBLE_SELL action is prominent, but there's a significant HOLD portion, indicating a strategy that may be more responsive to market changes rather than committing to one trend.
2.2 PPO and A2C applied to Forex Trading (EUR_USD) with gym_anytrading environment¶
Goal: *The model will be trained on data from one month, selected for its characteristic trading behavior, and then tested on another month known for significantly different trading activity. The primary objective is to assess the robustness and reliability of the model under varying market conditions, with the aim of ensuring that the model remains effective and minimizes potential losses even when market dynamics change.*
In forex trading, the behavior of currency pairs can vary widely throughout the year due to various economic, political, and seasonal factors.
January and August:
- January: The beginning of the year can see increased volatility and significant movements as traders and institutions adjust their portfolios for the new fiscal year.
- August: This month often experiences lower liquidity and higher volatility due to the summer holidays in many Western countries (reduced activity).
March and December:
- March: Often a period of heightened activity and volatility, partly due to the close of the fiscal year for many companies (for some its January, for others March - that is why both these months are present here) and adjustments in portfolios. Economic forecasts and central bank policy shifts announced in early spring can also influence trading.
- December: Early in the month, there might be robust activity as traders adjust positions and close books for the year. However, liquidity often drops significantly closer to the end of the month due to the holiday season, leading to potential spikes in volatility. (Tsai et al., 2020)
That is why when using an RL algorithm for identifying a better strategy for trading or hedging, it is important to validate if the model captures unexpected trends and patterns (Briola et al., 2021 ).
This can be done through application of the model to the market data of the same trading pair from another month. In this study we focus on comparison of March and December.
dataset = pd.read_csv('/content/drive/MyDrive/pyfinance.csv')
dataset['date'] = pd.to_datetime(dataset['date'])
desired_month = 3 # For March
desired_year = 2017 # For 2017, change as needed
df = dataset[(dataset['date'].dt.month == desired_month) & (dataset['date'].dt.year == desired_year)]
df.set_index('date', inplace=True)
desired_month = 12
desired_year = 2017 # For 2017
df2 = dataset[(dataset['date'].dt.month == desired_month) & (dataset['date'].dt.year == desired_year)]
df2.set_index('date', inplace=True)
2.2.1 A2C¶
This model divides the problem into two main components: the actor, which decides which action to take, and the critic, which evaluates the action taken by the actor by computing the value function. When applied to high-frequency trading (HFT) in the forex market, the Actor-Critic model offers several notable advantages (Liu et al., 2022).
In the context of forex trading, the actor component of the model is responsible for deciding on trading actions. The critic assesses the actions taken by the actor by estimating the value function of the current policy.
window_size = 180
start_date = "2017-03-01"
end_date = "2017-03-31"
filtered_data_a2c = df[start_date:end_date]
start_index = window_size
end_index = len(filtered_data_a2c)
/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
Environment¶
def my_process_data(env):
prices = env.df.loc[:, 'Close'].to_numpy()
prices = prices[env.frame_bound[0]-env.window_size:env.frame_bound[1]]
diff = np.insert(np.diff(prices), 0, 0)#
try:
sma = env.df.loc[:, 'SMA'].to_numpy()
macd = env.df.loc[:, 'MACD_SIG'].to_numpy()
rsi = env.df.loc[:, 'RSI'].to_numpy()
sma = sma[env.frame_bound[0]-env.window_size:env.frame_bound[1]]
macd = macd[env.frame_bound[0]-env.window_size:env.frame_bound[1]]
rsi = rsi[env.frame_bound[0]-env.window_size:env.frame_bound[1]]
signal_features = np.column_stack((prices, diff, sma, macd, rsi))
except:
print("(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)")
signal_features = np.column_stack((prices, diff))
return prices.astype(np.float32), signal_features.astype(np.float32)
class CustomForex(ForexEnv):
_process_data = my_process_data
my_custom_env = CustomForex(df=filtered_data_a2c, window_size=window_size, frame_bound=(start_index, end_index))
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
Model Training¶
The warning here is just related to the cutted output lines because colab allowed to print just 5000 lines while training was done for more (ingore the warning as printing was for observing purposes)
model_a2c = A2C("MlpPolicy", my_custom_env, verbose=1, seed = 2023)
model_a2c.learn(total_timesteps=500000)
Выходные данные были обрезаны до нескольких последних строк (5000).
| value_loss | 3.33e-18 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 437 |
| iterations | 70700 |
| time_elapsed | 807 |
| total_timesteps | 353500 |
| train/ | |
| entropy_loss | -0.000722 |
| explained_variance | -0.185 |
| learning_rate | 0.0007 |
| n_updates | 70699 |
| policy_loss | 2.91e-12 |
| value_loss | 3.4e-15 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 437 |
| iterations | 70800 |
| time_elapsed | 809 |
| total_timesteps | 354000 |
| train/ | |
| entropy_loss | -0.000722 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 70799 |
| policy_loss | 2.43e-12 |
| value_loss | 1.4e-15 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 437 |
| iterations | 70900 |
| time_elapsed | 810 |
| total_timesteps | 354500 |
| train/ | |
| entropy_loss | -0.000722 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 70899 |
| policy_loss | 8.9e-08 |
| value_loss | 2.07e-06 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 437 |
| iterations | 71000 |
| time_elapsed | 811 |
| total_timesteps | 355000 |
| train/ | |
| entropy_loss | -0.000722 |
| explained_variance | 0.00213 |
| learning_rate | 0.0007 |
| n_updates | 70999 |
| policy_loss | 3.48e-09 |
| value_loss | 3.17e-09 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 437 |
| iterations | 71100 |
| time_elapsed | 812 |
| total_timesteps | 355500 |
| train/ | |
| entropy_loss | -0.000722 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 71099 |
| policy_loss | 8.19e-08 |
| value_loss | 1.76e-06 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 437 |
| iterations | 71200 |
| time_elapsed | 813 |
| total_timesteps | 356000 |
| train/ | |
| entropy_loss | -0.000723 |
| explained_variance | -3.47e-05 |
| learning_rate | 0.0007 |
| n_updates | 71199 |
| policy_loss | 7.13e-08 |
| value_loss | 1.33e-06 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 437 |
| iterations | 71300 |
| time_elapsed | 814 |
| total_timesteps | 356500 |
| train/ | |
| entropy_loss | -0.000722 |
| explained_variance | 5.39e-05 |
| learning_rate | 0.0007 |
| n_updates | 71299 |
| policy_loss | 4.59e-08 |
| value_loss | 5.52e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 437 |
| iterations | 71400 |
| time_elapsed | 815 |
| total_timesteps | 357000 |
| train/ | |
| entropy_loss | -0.000722 |
| explained_variance | -0.000138 |
| learning_rate | 0.0007 |
| n_updates | 71399 |
| policy_loss | 3.6e-08 |
| value_loss | 3.39e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 437 |
| iterations | 71500 |
| time_elapsed | 817 |
| total_timesteps | 357500 |
| train/ | |
| entropy_loss | -0.000723 |
| explained_variance | 0.000123 |
| learning_rate | 0.0007 |
| n_updates | 71499 |
| policy_loss | 4e-08 |
| value_loss | 4.19e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 437 |
| iterations | 71600 |
| time_elapsed | 818 |
| total_timesteps | 358000 |
| train/ | |
| entropy_loss | -0.000723 |
| explained_variance | -0.000209 |
| learning_rate | 0.0007 |
| n_updates | 71599 |
| policy_loss | 4.73e-08 |
| value_loss | 5.86e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 436 |
| iterations | 71700 |
| time_elapsed | 820 |
| total_timesteps | 358500 |
| train/ | |
| entropy_loss | -0.000723 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 71699 |
| policy_loss | 4.71e-08 |
| value_loss | 5.8e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 436 |
| iterations | 71800 |
| time_elapsed | 822 |
| total_timesteps | 359000 |
| train/ | |
| entropy_loss | -0.000723 |
| explained_variance | -5.54e-05 |
| learning_rate | 0.0007 |
| n_updates | 71799 |
| policy_loss | 4.43e-08 |
| value_loss | 5.14e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 436 |
| iterations | 71900 |
| time_elapsed | 823 |
| total_timesteps | 359500 |
| train/ | |
| entropy_loss | -0.000723 |
| explained_variance | 0.000167 |
| learning_rate | 0.0007 |
| n_updates | 71899 |
| policy_loss | 4.42e-08 |
| value_loss | 5.1e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 436 |
| iterations | 72000 |
| time_elapsed | 824 |
| total_timesteps | 360000 |
| train/ | |
| entropy_loss | -0.000723 |
| explained_variance | 5.5e-05 |
| learning_rate | 0.0007 |
| n_updates | 71999 |
| policy_loss | 4.51e-08 |
| value_loss | 5.32e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 435 |
| iterations | 72100 |
| time_elapsed | 826 |
| total_timesteps | 360500 |
| train/ | |
| entropy_loss | -0.000723 |
| explained_variance | 5.47e-05 |
| learning_rate | 0.0007 |
| n_updates | 72099 |
| policy_loss | 4.53e-08 |
| value_loss | 5.37e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 435 |
| iterations | 72200 |
| time_elapsed | 828 |
| total_timesteps | 361000 |
| train/ | |
| entropy_loss | -0.000723 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 72199 |
| policy_loss | 4.5e-08 |
| value_loss | 5.29e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 434 |
| iterations | 72300 |
| time_elapsed | 832 |
| total_timesteps | 361500 |
| train/ | |
| entropy_loss | -0.000723 |
| explained_variance | -0.00011 |
| learning_rate | 0.0007 |
| n_updates | 72299 |
| policy_loss | 4.49e-08 |
| value_loss | 5.27e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 434 |
| iterations | 72400 |
| time_elapsed | 833 |
| total_timesteps | 362000 |
| train/ | |
| entropy_loss | -0.000723 |
| explained_variance | -0.00011 |
| learning_rate | 0.0007 |
| n_updates | 72399 |
| policy_loss | 4.49e-08 |
| value_loss | 5.29e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 434 |
| iterations | 72500 |
| time_elapsed | 834 |
| total_timesteps | 362500 |
| train/ | |
| entropy_loss | -0.000722 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 72499 |
| policy_loss | 4.5e-08 |
| value_loss | 5.3e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 434 |
| iterations | 72600 |
| time_elapsed | 835 |
| total_timesteps | 363000 |
| train/ | |
| entropy_loss | -0.000722 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 72599 |
| policy_loss | 4.49e-08 |
| value_loss | 5.28e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 434 |
| iterations | 72700 |
| time_elapsed | 836 |
| total_timesteps | 363500 |
| train/ | |
| entropy_loss | -0.000722 |
| explained_variance | -0.000109 |
| learning_rate | 0.0007 |
| n_updates | 72699 |
| policy_loss | 4.49e-08 |
| value_loss | 5.28e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 434 |
| iterations | 72800 |
| time_elapsed | 837 |
| total_timesteps | 364000 |
| train/ | |
| entropy_loss | -0.000722 |
| explained_variance | 5.96e-08 |
| learning_rate | 0.0007 |
| n_updates | 72799 |
| policy_loss | 4.49e-08 |
| value_loss | 5.27e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 434 |
| iterations | 72900 |
| time_elapsed | 838 |
| total_timesteps | 364500 |
| train/ | |
| entropy_loss | -0.000722 |
| explained_variance | 0.000109 |
| learning_rate | 0.0007 |
| n_updates | 72899 |
| policy_loss | 4.49e-08 |
| value_loss | 5.28e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 434 |
| iterations | 73000 |
| time_elapsed | 839 |
| total_timesteps | 365000 |
| train/ | |
| entropy_loss | -0.00164 |
| explained_variance | -0.00129 |
| learning_rate | 0.0007 |
| n_updates | 72999 |
| policy_loss | -9.43e-09 |
| value_loss | 3.8e-09 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 434 |
| iterations | 73100 |
| time_elapsed | 840 |
| total_timesteps | 365500 |
| train/ | |
| entropy_loss | -0.00164 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 73099 |
| policy_loss | 1.03e-11 |
| value_loss | 3.75e-15 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 435 |
| iterations | 73200 |
| time_elapsed | 841 |
| total_timesteps | 366000 |
| train/ | |
| entropy_loss | -0.00164 |
| explained_variance | 1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 73199 |
| policy_loss | 4.1e-12 |
| value_loss | 8.61e-16 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 435 |
| iterations | 73300 |
| time_elapsed | 842 |
| total_timesteps | 366500 |
| train/ | |
| entropy_loss | -0.00164 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 73299 |
| policy_loss | 7.71e-12 |
| value_loss | 2.54e-15 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 435 |
| iterations | 73400 |
| time_elapsed | 843 |
| total_timesteps | 367000 |
| train/ | |
| entropy_loss | -0.00164 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 73399 |
| policy_loss | 1.1e-11 |
| value_loss | 5.18e-15 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 435 |
| iterations | 73500 |
| time_elapsed | 844 |
| total_timesteps | 367500 |
| train/ | |
| entropy_loss | -0.00164 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 73499 |
| policy_loss | 2.12e-09 |
| value_loss | 1.92e-10 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 435 |
| iterations | 73600 |
| time_elapsed | 845 |
| total_timesteps | 368000 |
| train/ | |
| entropy_loss | -0.000721 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 73599 |
| policy_loss | 1.05e-10 |
| value_loss | 2.86e-12 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 435 |
| iterations | 73700 |
| time_elapsed | 846 |
| total_timesteps | 368500 |
| train/ | |
| entropy_loss | -0.000721 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 73699 |
| policy_loss | 1.17e-13 |
| value_loss | 3.56e-18 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 435 |
| iterations | 73800 |
| time_elapsed | 847 |
| total_timesteps | 369000 |
| train/ | |
| entropy_loss | -0.000721 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 73799 |
| policy_loss | 6.43e-13 |
| value_loss | 1.08e-16 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 435 |
| iterations | 73900 |
| time_elapsed | 848 |
| total_timesteps | 369500 |
| train/ | |
| entropy_loss | -0.000721 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 73899 |
| policy_loss | 5.08e-14 |
| value_loss | 6.74e-19 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 435 |
| iterations | 74000 |
| time_elapsed | 849 |
| total_timesteps | 370000 |
| train/ | |
| entropy_loss | -0.000721 |
| explained_variance | -4.99 |
| learning_rate | 0.0007 |
| n_updates | 73999 |
| policy_loss | -1.42e-12 |
| value_loss | 2.09e-15 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 435 |
| iterations | 74100 |
| time_elapsed | 850 |
| total_timesteps | 370500 |
| train/ | |
| entropy_loss | -0.000721 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 74099 |
| policy_loss | 4.09e-10 |
| value_loss | 4.39e-11 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 435 |
| iterations | 74200 |
| time_elapsed | 851 |
| total_timesteps | 371000 |
| train/ | |
| entropy_loss | -0.000721 |
| explained_variance | 3.05e-05 |
| learning_rate | 0.0007 |
| n_updates | 74199 |
| policy_loss | 7.99e-08 |
| value_loss | 1.67e-06 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 435 |
| iterations | 74300 |
| time_elapsed | 852 |
| total_timesteps | 371500 |
| train/ | |
| entropy_loss | -0.000721 |
| explained_variance | -3.58e-07 |
| learning_rate | 0.0007 |
| n_updates | 74299 |
| policy_loss | 7.57e-08 |
| value_loss | 1.5e-06 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 435 |
| iterations | 74400 |
| time_elapsed | 853 |
| total_timesteps | 372000 |
| train/ | |
| entropy_loss | -0.000721 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 74399 |
| policy_loss | 6.94e-08 |
| value_loss | 1.26e-06 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 436 |
| iterations | 74500 |
| time_elapsed | 854 |
| total_timesteps | 372500 |
| train/ | |
| entropy_loss | -0.000721 |
| explained_variance | -0.000104 |
| learning_rate | 0.0007 |
| n_updates | 74499 |
| policy_loss | 4.72e-08 |
| value_loss | 5.84e-07 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 436 |
| iterations | 74600 |
| time_elapsed | 855 |
| total_timesteps | 373000 |
| train/ | |
| entropy_loss | -0.00164 |
| explained_variance | -65.1 |
| learning_rate | 0.0007 |
| n_updates | 74599 |
| policy_loss | 7.39e-12 |
| value_loss | 2.76e-15 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 436 |
| iterations | 74700 |
| time_elapsed | 856 |
| total_timesteps | 373500 |
| train/ | |
| entropy_loss | -0.00164 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 74699 |
| policy_loss | -2.29e-13 |
| value_loss | 2.22e-18 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 436 |
| iterations | 74800 |
| time_elapsed | 857 |
| total_timesteps | 374000 |
| train/ | |
| entropy_loss | -0.00164 |
| explained_variance | -14 |
| learning_rate | 0.0007 |
| n_updates | 74799 |
| policy_loss | -7.62e-12 |
| value_loss | 3.46e-15 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 436 |
| iterations | 74900 |
| time_elapsed | 858 |
| total_timesteps | 374500 |
| train/ | |
| entropy_loss | -0.00164 |
| explained_variance | 0.0845 |
| learning_rate | 0.0007 |
| n_updates | 74899 |
| policy_loss | 5.85e-11 |
| value_loss | 1.41e-13 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 436 |
| iterations | 75000 |
| time_elapsed | 859 |
| total_timesteps | 375000 |
| train/ | |
| entropy_loss | -0.00164 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 74999 |
| policy_loss | 5.46e-08 |
| value_loss | 1.26e-07 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 147 |
| time/ | |
| fps | 436 |
| iterations | 75100 |
| time_elapsed | 860 |
| total_timesteps | 375500 |
| train/ | |
| entropy_loss | -0.00164 |
| explained_variance | -0.000422 |
| learning_rate | 0.0007 |
| n_updates | 75099 |
| policy_loss | 1.44e-08 |
| value_loss | 8.72e-09 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 436 |
| iterations | 75200 |
| time_elapsed | 861 |
| total_timesteps | 376000 |
| train/ | |
| entropy_loss | -0.00164 |
| explained_variance | -0.000153 |
| learning_rate | 0.0007 |
| n_updates | 75199 |
| policy_loss | -4.03e-08 |
| value_loss | 6.87e-08 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 436 |
| iterations | 75300 |
| time_elapsed | 862 |
| total_timesteps | 376500 |
| train/ | |
| entropy_loss | -0.00164 |
| explained_variance | -1.85 |
| learning_rate | 0.0007 |
| n_updates | 75299 |
| policy_loss | -1.04e-11 |
| value_loss | 3.84e-15 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 436 |
| iterations | 75400 |
| time_elapsed | 863 |
| total_timesteps | 377000 |
| train/ | |
| entropy_loss | -0.00164 |
| explained_variance | -54.6 |
| learning_rate | 0.0007 |
| n_updates | 75399 |
| policy_loss | 1.57e-11 |
| value_loss | 8.6e-15 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 436 |
| iterations | 75500 |
| time_elapsed | 864 |
| total_timesteps | 377500 |
| train/ | |
| entropy_loss | -0.00164 |
| explained_variance | -47.9 |
| learning_rate | 0.0007 |
| n_updates | 75499 |
| policy_loss | 1.12e-11 |
| value_loss | 9.14e-15 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 436 |
| iterations | 75600 |
| time_elapsed | 865 |
| total_timesteps | 378000 |
| train/ | |
| entropy_loss | -0.00072 |
| explained_variance | -0.000143 |
| learning_rate | 0.0007 |
| n_updates | 75599 |
| policy_loss | 5.14e-08 |
| value_loss | 7.01e-07 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 436 |
| iterations | 75700 |
| time_elapsed | 866 |
| total_timesteps | 378500 |
| train/ | |
| entropy_loss | -0.00072 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 75699 |
| policy_loss | 1.9e-13 |
| value_loss | 9.62e-18 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 436 |
| iterations | 75800 |
| time_elapsed | 867 |
| total_timesteps | 379000 |
| train/ | |
| entropy_loss | -0.00072 |
| explained_variance | -6.47 |
| learning_rate | 0.0007 |
| n_updates | 75799 |
| policy_loss | 4.2e-13 |
| value_loss | 5.44e-16 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 437 |
| iterations | 75900 |
| time_elapsed | 868 |
| total_timesteps | 379500 |
| train/ | |
| entropy_loss | -0.00072 |
| explained_variance | 0.111 |
| learning_rate | 0.0007 |
| n_updates | 75899 |
| policy_loss | 8.29e-12 |
| value_loss | 2.24e-14 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 437 |
| iterations | 76000 |
| time_elapsed | 869 |
| total_timesteps | 380000 |
| train/ | |
| entropy_loss | -0.00072 |
| explained_variance | 7.96e-05 |
| learning_rate | 0.0007 |
| n_updates | 75999 |
| policy_loss | 9.25e-08 |
| value_loss | 2.27e-06 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 437 |
| iterations | 76100 |
| time_elapsed | 869 |
| total_timesteps | 380500 |
| train/ | |
| entropy_loss | -0.00072 |
| explained_variance | -0.000688 |
| learning_rate | 0.0007 |
| n_updates | 76099 |
| policy_loss | 3.67e-09 |
| value_loss | 3.57e-09 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 437 |
| iterations | 76200 |
| time_elapsed | 870 |
| total_timesteps | 381000 |
| train/ | |
| entropy_loss | -0.00072 |
| explained_variance | -9.6e-05 |
| learning_rate | 0.0007 |
| n_updates | 76199 |
| policy_loss | 2.54e-08 |
| value_loss | 1.71e-07 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 437 |
| iterations | 76300 |
| time_elapsed | 872 |
| total_timesteps | 381500 |
| train/ | |
| entropy_loss | -0.00072 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 76299 |
| policy_loss | 3.09e-08 |
| value_loss | 2.54e-07 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 437 |
| iterations | 76400 |
| time_elapsed | 873 |
| total_timesteps | 382000 |
| train/ | |
| entropy_loss | -0.00072 |
| explained_variance | 6.56e-07 |
| learning_rate | 0.0007 |
| n_updates | 76399 |
| policy_loss | 3.05e-08 |
| value_loss | 2.47e-07 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 437 |
| iterations | 76500 |
| time_elapsed | 874 |
| total_timesteps | 382500 |
| train/ | |
| entropy_loss | -0.00072 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 76499 |
| policy_loss | 3.9e-08 |
| value_loss | 4.04e-07 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 437 |
| iterations | 76600 |
| time_elapsed | 875 |
| total_timesteps | 383000 |
| train/ | |
| entropy_loss | -0.00163 |
| explained_variance | 0.000261 |
| learning_rate | 0.0007 |
| n_updates | 76599 |
| policy_loss | -6.96e-08 |
| value_loss | 2.08e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 437 |
| iterations | 76700 |
| time_elapsed | 876 |
| total_timesteps | 383500 |
| train/ | |
| entropy_loss | -0.00163 |
| explained_variance | 5.96e-08 |
| learning_rate | 0.0007 |
| n_updates | 76699 |
| policy_loss | -3.63e-12 |
| value_loss | 5.66e-16 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 437 |
| iterations | 76800 |
| time_elapsed | 877 |
| total_timesteps | 384000 |
| train/ | |
| entropy_loss | -0.00163 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 76799 |
| policy_loss | -7.89e-13 |
| value_loss | 2.67e-17 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 437 |
| iterations | 76900 |
| time_elapsed | 878 |
| total_timesteps | 384500 |
| train/ | |
| entropy_loss | -0.00163 |
| explained_variance | -1.06 |
| learning_rate | 0.0007 |
| n_updates | 76899 |
| policy_loss | 1e-11 |
| value_loss | 5.95e-15 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 437 |
| iterations | 77000 |
| time_elapsed | 879 |
| total_timesteps | 385000 |
| train/ | |
| entropy_loss | -0.00163 |
| explained_variance | -80.6 |
| learning_rate | 0.0007 |
| n_updates | 76999 |
| policy_loss | -9.71e-12 |
| value_loss | 2.89e-15 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 438 |
| iterations | 77100 |
| time_elapsed | 880 |
| total_timesteps | 385500 |
| train/ | |
| entropy_loss | -0.00163 |
| explained_variance | -0.000785 |
| learning_rate | 0.0007 |
| n_updates | 77099 |
| policy_loss | 7.73e-09 |
| value_loss | 2.57e-09 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 438 |
| iterations | 77200 |
| time_elapsed | 880 |
| total_timesteps | 386000 |
| train/ | |
| entropy_loss | -0.00163 |
| explained_variance | 0.000607 |
| learning_rate | 0.0007 |
| n_updates | 77199 |
| policy_loss | 2.02e-08 |
| value_loss | 1.75e-08 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 438 |
| iterations | 77300 |
| time_elapsed | 881 |
| total_timesteps | 386500 |
| train/ | |
| entropy_loss | -0.00163 |
| explained_variance | 0.000553 |
| learning_rate | 0.0007 |
| n_updates | 77299 |
| policy_loss | 3.3e-08 |
| value_loss | 4.67e-08 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 438 |
| iterations | 77400 |
| time_elapsed | 882 |
| total_timesteps | 387000 |
| train/ | |
| entropy_loss | -0.00163 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 77399 |
| policy_loss | 6.28e-08 |
| value_loss | 1.69e-07 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 438 |
| iterations | 77500 |
| time_elapsed | 883 |
| total_timesteps | 387500 |
| train/ | |
| entropy_loss | -0.00366 |
| explained_variance | 5.96e-08 |
| learning_rate | 0.0007 |
| n_updates | 77499 |
| policy_loss | -5.24e-10 |
| value_loss | 1.93e-12 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 438 |
| iterations | 77600 |
| time_elapsed | 884 |
| total_timesteps | 388000 |
| train/ | |
| entropy_loss | -0.00366 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 77599 |
| policy_loss | -1.76e-12 |
| value_loss | 5.66e-16 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 438 |
| iterations | 77700 |
| time_elapsed | 885 |
| total_timesteps | 388500 |
| train/ | |
| entropy_loss | -0.00366 |
| explained_variance | -140 |
| learning_rate | 0.0007 |
| n_updates | 77699 |
| policy_loss | 1.2e-11 |
| value_loss | 8.62e-16 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 438 |
| iterations | 77800 |
| time_elapsed | 887 |
| total_timesteps | 389000 |
| train/ | |
| entropy_loss | -0.00366 |
| explained_variance | -1.35 |
| learning_rate | 0.0007 |
| n_updates | 77799 |
| policy_loss | -2.78e-11 |
| value_loss | 5.14e-15 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 438 |
| iterations | 77900 |
| time_elapsed | 888 |
| total_timesteps | 389500 |
| train/ | |
| entropy_loss | -0.00163 |
| explained_variance | -235 |
| learning_rate | 0.0007 |
| n_updates | 77899 |
| policy_loss | -3.3e-12 |
| value_loss | 1.92e-15 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 438 |
| iterations | 78000 |
| time_elapsed | 889 |
| total_timesteps | 390000 |
| train/ | |
| entropy_loss | -0.00163 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 77999 |
| policy_loss | -9.32e-12 |
| value_loss | 3.09e-15 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 438 |
| iterations | 78100 |
| time_elapsed | 890 |
| total_timesteps | 390500 |
| train/ | |
| entropy_loss | -0.00163 |
| explained_variance | -9.3 |
| learning_rate | 0.0007 |
| n_updates | 78099 |
| policy_loss | 1.06e-11 |
| value_loss | 5.62e-15 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 438 |
| iterations | 78200 |
| time_elapsed | 891 |
| total_timesteps | 391000 |
| train/ | |
| entropy_loss | -0.00163 |
| explained_variance | -6.09e+03 |
| learning_rate | 0.0007 |
| n_updates | 78199 |
| policy_loss | -2.11e-12 |
| value_loss | 7.37e-16 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 438 |
| iterations | 78300 |
| time_elapsed | 892 |
| total_timesteps | 391500 |
| train/ | |
| entropy_loss | -0.00163 |
| explained_variance | 0.00568 |
| learning_rate | 0.0007 |
| n_updates | 78299 |
| policy_loss | -1.04e-09 |
| value_loss | 4.65e-11 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 438 |
| iterations | 78400 |
| time_elapsed | 893 |
| total_timesteps | 392000 |
| train/ | |
| entropy_loss | -0.00163 |
| explained_variance | 8.01e-05 |
| learning_rate | 0.0007 |
| n_updates | 78399 |
| policy_loss | -2.26e-07 |
| value_loss | 2.2e-06 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 438 |
| iterations | 78500 |
| time_elapsed | 894 |
| total_timesteps | 392500 |
| train/ | |
| entropy_loss | -0.00163 |
| explained_variance | 6.8e-05 |
| learning_rate | 0.0007 |
| n_updates | 78499 |
| policy_loss | -1.78e-07 |
| value_loss | 1.37e-06 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 439 |
| iterations | 78600 |
| time_elapsed | 895 |
| total_timesteps | 393000 |
| train/ | |
| entropy_loss | -0.00163 |
| explained_variance | -7.13e-05 |
| learning_rate | 0.0007 |
| n_updates | 78599 |
| policy_loss | -1.7e-07 |
| value_loss | 1.25e-06 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 439 |
| iterations | 78700 |
| time_elapsed | 895 |
| total_timesteps | 393500 |
| train/ | |
| entropy_loss | -0.00366 |
| explained_variance | -0.0318 |
| learning_rate | 0.0007 |
| n_updates | 78699 |
| policy_loss | -5.09e-10 |
| value_loss | 1.83e-12 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 439 |
| iterations | 78800 |
| time_elapsed | 896 |
| total_timesteps | 394000 |
| train/ | |
| entropy_loss | -0.00366 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 78799 |
| policy_loss | 2.09e-12 |
| value_loss | 3.06e-17 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 439 |
| iterations | 78900 |
| time_elapsed | 897 |
| total_timesteps | 394500 |
| train/ | |
| entropy_loss | -0.00685 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 78899 |
| policy_loss | -6.63e-09 |
| value_loss | 7.44e-11 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 439 |
| iterations | 79000 |
| time_elapsed | 898 |
| total_timesteps | 395000 |
| train/ | |
| entropy_loss | -0.00685 |
| explained_variance | -155 |
| learning_rate | 0.0007 |
| n_updates | 78999 |
| policy_loss | 3.72e-11 |
| value_loss | 1.78e-15 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 439 |
| iterations | 79100 |
| time_elapsed | 899 |
| total_timesteps | 395500 |
| train/ | |
| entropy_loss | -0.00685 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 79099 |
| policy_loss | 1.03e-11 |
| value_loss | 1.8e-16 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 439 |
| iterations | 79200 |
| time_elapsed | 901 |
| total_timesteps | 396000 |
| train/ | |
| entropy_loss | -0.00685 |
| explained_variance | -43.1 |
| learning_rate | 0.0007 |
| n_updates | 79199 |
| policy_loss | -7.02e-12 |
| value_loss | 2.61e-15 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 439 |
| iterations | 79300 |
| time_elapsed | 902 |
| total_timesteps | 396500 |
| train/ | |
| entropy_loss | -0.00685 |
| explained_variance | -2.7 |
| learning_rate | 0.0007 |
| n_updates | 79299 |
| policy_loss | 7.32e-12 |
| value_loss | 1.06e-15 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 439 |
| iterations | 79400 |
| time_elapsed | 903 |
| total_timesteps | 397000 |
| train/ | |
| entropy_loss | -0.00685 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 79399 |
| policy_loss | 2.52e-10 |
| value_loss | 1.15e-13 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 439 |
| iterations | 79500 |
| time_elapsed | 904 |
| total_timesteps | 397500 |
| train/ | |
| entropy_loss | -0.015 |
| explained_variance | -3.14 |
| learning_rate | 0.0007 |
| n_updates | 79499 |
| policy_loss | -8.54e-11 |
| value_loss | 2.42e-15 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 439 |
| iterations | 79600 |
| time_elapsed | 905 |
| total_timesteps | 398000 |
| train/ | |
| entropy_loss | -0.015 |
| explained_variance | -288 |
| learning_rate | 0.0007 |
| n_updates | 79599 |
| policy_loss | 6.27e-11 |
| value_loss | 4.62e-17 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 439 |
| iterations | 79700 |
| time_elapsed | 906 |
| total_timesteps | 398500 |
| train/ | |
| entropy_loss | -0.015 |
| explained_variance | -8.18e+03 |
| learning_rate | 0.0007 |
| n_updates | 79699 |
| policy_loss | -9.75e-11 |
| value_loss | 3.45e-15 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 439 |
| iterations | 79800 |
| time_elapsed | 907 |
| total_timesteps | 399000 |
| train/ | |
| entropy_loss | -0.00728 |
| explained_variance | -14.7 |
| learning_rate | 0.0007 |
| n_updates | 79799 |
| policy_loss | -1.16e-10 |
| value_loss | 1.97e-14 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 439 |
| iterations | 79900 |
| time_elapsed | 908 |
| total_timesteps | 399500 |
| train/ | |
| entropy_loss | -0.00728 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 79899 |
| policy_loss | -9.03e-12 |
| value_loss | 1.2e-16 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 440 |
| iterations | 80000 |
| time_elapsed | 909 |
| total_timesteps | 400000 |
| train/ | |
| entropy_loss | -0.0154 |
| explained_variance | -0.114 |
| learning_rate | 0.0007 |
| n_updates | 79999 |
| policy_loss | -1.42e-09 |
| value_loss | 5.51e-13 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 440 |
| iterations | 80100 |
| time_elapsed | 909 |
| total_timesteps | 400500 |
| train/ | |
| entropy_loss | -0.0562 |
| explained_variance | 8.94e-07 |
| learning_rate | 0.0007 |
| n_updates | 80099 |
| policy_loss | -4.55e-06 |
| value_loss | 2.46e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 440 |
| iterations | 80200 |
| time_elapsed | 910 |
| total_timesteps | 401000 |
| train/ | |
| entropy_loss | -0.37 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 80199 |
| policy_loss | -1.99 |
| value_loss | 13.9 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 440 |
| iterations | 80300 |
| time_elapsed | 911 |
| total_timesteps | 401500 |
| train/ | |
| entropy_loss | -0.592 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 80299 |
| policy_loss | -0.144 |
| value_loss | 0.187 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 440 |
| iterations | 80400 |
| time_elapsed | 912 |
| total_timesteps | 402000 |
| train/ | |
| entropy_loss | -0.488 |
| explained_variance | -2.38e-07 |
| learning_rate | 0.0007 |
| n_updates | 80399 |
| policy_loss | 0.336 |
| value_loss | 0.374 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 440 |
| iterations | 80500 |
| time_elapsed | 914 |
| total_timesteps | 402500 |
| train/ | |
| entropy_loss | -0.338 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 80499 |
| policy_loss | 0.000425 |
| value_loss | 1.74e-05 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 440 |
| iterations | 80600 |
| time_elapsed | 915 |
| total_timesteps | 403000 |
| train/ | |
| entropy_loss | -0.206 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 80599 |
| policy_loss | 2.25 |
| value_loss | 10.8 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 440 |
| iterations | 80700 |
| time_elapsed | 916 |
| total_timesteps | 403500 |
| train/ | |
| entropy_loss | -0.27 |
| explained_variance | -0.000339 |
| learning_rate | 0.0007 |
| n_updates | 80699 |
| policy_loss | -1.7e-05 |
| value_loss | 5.56e-08 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 440 |
| iterations | 80800 |
| time_elapsed | 917 |
| total_timesteps | 404000 |
| train/ | |
| entropy_loss | -0.0949 |
| explained_variance | 1.01e-06 |
| learning_rate | 0.0007 |
| n_updates | 80799 |
| policy_loss | 6.52e-06 |
| value_loss | 1.38e-07 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 440 |
| iterations | 80900 |
| time_elapsed | 918 |
| total_timesteps | 404500 |
| train/ | |
| entropy_loss | -0.0462 |
| explained_variance | -0.00567 |
| learning_rate | 0.0007 |
| n_updates | 80899 |
| policy_loss | 1.02e-07 |
| value_loss | 1.99e-10 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 440 |
| iterations | 81000 |
| time_elapsed | 919 |
| total_timesteps | 405000 |
| train/ | |
| entropy_loss | -0.0403 |
| explained_variance | 8.8e-05 |
| learning_rate | 0.0007 |
| n_updates | 80999 |
| policy_loss | 2.78e-06 |
| value_loss | 2.08e-07 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 440 |
| iterations | 81100 |
| time_elapsed | 920 |
| total_timesteps | 405500 |
| train/ | |
| entropy_loss | -0.0233 |
| explained_variance | -6.21e-05 |
| learning_rate | 0.0007 |
| n_updates | 81099 |
| policy_loss | 4.12e-06 |
| value_loss | 1.67e-06 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 440 |
| iterations | 81200 |
| time_elapsed | 921 |
| total_timesteps | 406000 |
| train/ | |
| entropy_loss | -0.0123 |
| explained_variance | 1.81e-05 |
| learning_rate | 0.0007 |
| n_updates | 81199 |
| policy_loss | 3.37e-06 |
| value_loss | 5e-06 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 440 |
| iterations | 81300 |
| time_elapsed | 922 |
| total_timesteps | 406500 |
| train/ | |
| entropy_loss | -0.0123 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 81299 |
| policy_loss | -1.45e-11 |
| value_loss | 9.3e-17 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 440 |
| iterations | 81400 |
| time_elapsed | 923 |
| total_timesteps | 407000 |
| train/ | |
| entropy_loss | -0.0121 |
| explained_variance | -0.00345 |
| learning_rate | 0.0007 |
| n_updates | 81399 |
| policy_loss | -6.08e-11 |
| value_loss | 2.96e-15 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 441 |
| iterations | 81500 |
| time_elapsed | 923 |
| total_timesteps | 407500 |
| train/ | |
| entropy_loss | -0.00594 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 81499 |
| policy_loss | -8.49e-10 |
| value_loss | 1.69e-12 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 441 |
| iterations | 81600 |
| time_elapsed | 924 |
| total_timesteps | 408000 |
| train/ | |
| entropy_loss | -0.00594 |
| explained_variance | -24.4 |
| learning_rate | 0.0007 |
| n_updates | 81599 |
| policy_loss | -3.2e-14 |
| value_loss | 7.88e-16 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 441 |
| iterations | 81700 |
| time_elapsed | 925 |
| total_timesteps | 408500 |
| train/ | |
| entropy_loss | -0.00594 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 81699 |
| policy_loss | 5.43e-12 |
| value_loss | 6.89e-17 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 441 |
| iterations | 81800 |
| time_elapsed | 926 |
| total_timesteps | 409000 |
| train/ | |
| entropy_loss | -0.00594 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 81799 |
| policy_loss | -1.07e-12 |
| value_loss | 2.7e-18 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 441 |
| iterations | 81900 |
| time_elapsed | 927 |
| total_timesteps | 409500 |
| train/ | |
| entropy_loss | -0.00273 |
| explained_variance | -12.5 |
| learning_rate | 0.0007 |
| n_updates | 81899 |
| policy_loss | -8.99e-12 |
| value_loss | 1.82e-15 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 441 |
| iterations | 82000 |
| time_elapsed | 929 |
| total_timesteps | 410000 |
| train/ | |
| entropy_loss | -0.00273 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 81999 |
| policy_loss | 1.22e-12 |
| value_loss | 2.03e-17 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 441 |
| iterations | 82100 |
| time_elapsed | 930 |
| total_timesteps | 410500 |
| train/ | |
| entropy_loss | -0.00273 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 82099 |
| policy_loss | -1.51e-12 |
| value_loss | 3.1e-17 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 441 |
| iterations | 82200 |
| time_elapsed | 931 |
| total_timesteps | 411000 |
| train/ | |
| entropy_loss | -0.00273 |
| explained_variance | -8.16e+04 |
| learning_rate | 0.0007 |
| n_updates | 82199 |
| policy_loss | -2.48e-11 |
| value_loss | 9.18e-15 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 441 |
| iterations | 82300 |
| time_elapsed | 932 |
| total_timesteps | 411500 |
| train/ | |
| entropy_loss | -0.00273 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 82299 |
| policy_loss | -1.65e-14 |
| value_loss | 3.7e-21 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 441 |
| iterations | 82400 |
| time_elapsed | 933 |
| total_timesteps | 412000 |
| train/ | |
| entropy_loss | -0.00273 |
| explained_variance | 1.06e-05 |
| learning_rate | 0.0007 |
| n_updates | 82399 |
| policy_loss | -3.1e-09 |
| value_loss | 1.31e-10 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 441 |
| iterations | 82500 |
| time_elapsed | 934 |
| total_timesteps | 412500 |
| train/ | |
| entropy_loss | -0.00273 |
| explained_variance | -0.000116 |
| learning_rate | 0.0007 |
| n_updates | 82499 |
| policy_loss | -9.37e-08 |
| value_loss | 1.2e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 441 |
| iterations | 82600 |
| time_elapsed | 935 |
| total_timesteps | 413000 |
| train/ | |
| entropy_loss | -0.00273 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 82599 |
| policy_loss | -3.18e-07 |
| value_loss | 1.38e-06 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 441 |
| iterations | 82700 |
| time_elapsed | 935 |
| total_timesteps | 413500 |
| train/ | |
| entropy_loss | -0.00273 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 82699 |
| policy_loss | 82.8 |
| value_loss | 2.5e+03 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 441 |
| iterations | 82800 |
| time_elapsed | 936 |
| total_timesteps | 414000 |
| train/ | |
| entropy_loss | -0.00608 |
| explained_variance | -16.1 |
| learning_rate | 0.0007 |
| n_updates | 82799 |
| policy_loss | -3.73e-11 |
| value_loss | 3.09e-15 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 441 |
| iterations | 82900 |
| time_elapsed | 937 |
| total_timesteps | 414500 |
| train/ | |
| entropy_loss | -0.00287 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 82899 |
| policy_loss | 3.25e-10 |
| value_loss | 1.28e-12 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 442 |
| iterations | 83000 |
| time_elapsed | 938 |
| total_timesteps | 415000 |
| train/ | |
| entropy_loss | -0.00287 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 82999 |
| policy_loss | -7.82e-12 |
| value_loss | 7.4e-16 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 442 |
| iterations | 83100 |
| time_elapsed | 939 |
| total_timesteps | 415500 |
| train/ | |
| entropy_loss | -0.00287 |
| explained_variance | -22.1 |
| learning_rate | 0.0007 |
| n_updates | 83099 |
| policy_loss | -1.23e-11 |
| value_loss | 2.65e-15 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 442 |
| iterations | 83200 |
| time_elapsed | 940 |
| total_timesteps | 416000 |
| train/ | |
| entropy_loss | -0.00287 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 83199 |
| policy_loss | -1.49e-12 |
| value_loss | 2.67e-17 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 442 |
| iterations | 83300 |
| time_elapsed | 941 |
| total_timesteps | 416500 |
| train/ | |
| entropy_loss | -0.00287 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 83299 |
| policy_loss | 8.62e-12 |
| value_loss | 8.99e-16 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 442 |
| iterations | 83400 |
| time_elapsed | 943 |
| total_timesteps | 417000 |
| train/ | |
| entropy_loss | -0.00287 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 83399 |
| policy_loss | -5.86e-09 |
| value_loss | 4.16e-10 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 442 |
| iterations | 83500 |
| time_elapsed | 944 |
| total_timesteps | 417500 |
| train/ | |
| entropy_loss | -0.00287 |
| explained_variance | 0.000229 |
| learning_rate | 0.0007 |
| n_updates | 83499 |
| policy_loss | -3.02e-07 |
| value_loss | 1.1e-06 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 442 |
| iterations | 83600 |
| time_elapsed | 945 |
| total_timesteps | 418000 |
| train/ | |
| entropy_loss | -0.00287 |
| explained_variance | 0.000184 |
| learning_rate | 0.0007 |
| n_updates | 83599 |
| policy_loss | -1.86e-07 |
| value_loss | 4.17e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 442 |
| iterations | 83700 |
| time_elapsed | 946 |
| total_timesteps | 418500 |
| train/ | |
| entropy_loss | -0.00287 |
| explained_variance | -0.000121 |
| learning_rate | 0.0007 |
| n_updates | 83699 |
| policy_loss | -2.84e-07 |
| value_loss | 9.78e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 442 |
| iterations | 83800 |
| time_elapsed | 947 |
| total_timesteps | 419000 |
| train/ | |
| entropy_loss | -0.00287 |
| explained_variance | 4.42e-05 |
| learning_rate | 0.0007 |
| n_updates | 83799 |
| policy_loss | -2.61e-07 |
| value_loss | 8.26e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 442 |
| iterations | 83900 |
| time_elapsed | 948 |
| total_timesteps | 419500 |
| train/ | |
| entropy_loss | -0.00287 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 83899 |
| policy_loss | -1.94e-07 |
| value_loss | 4.54e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 442 |
| iterations | 84000 |
| time_elapsed | 949 |
| total_timesteps | 420000 |
| train/ | |
| entropy_loss | -0.00287 |
| explained_variance | 6.09e-05 |
| learning_rate | 0.0007 |
| n_updates | 83999 |
| policy_loss | -1.88e-07 |
| value_loss | 4.26e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 442 |
| iterations | 84100 |
| time_elapsed | 950 |
| total_timesteps | 420500 |
| train/ | |
| entropy_loss | -0.0133 |
| explained_variance | 0.00078 |
| learning_rate | 0.0007 |
| n_updates | 84099 |
| policy_loss | -4.23e-07 |
| value_loss | 6.55e-08 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 442 |
| iterations | 84200 |
| time_elapsed | 950 |
| total_timesteps | 421000 |
| train/ | |
| entropy_loss | -0.0133 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 84199 |
| policy_loss | 5.47e-12 |
| value_loss | 1.1e-17 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 442 |
| iterations | 84300 |
| time_elapsed | 951 |
| total_timesteps | 421500 |
| train/ | |
| entropy_loss | -0.0133 |
| explained_variance | -2.98e+03 |
| learning_rate | 0.0007 |
| n_updates | 84299 |
| policy_loss | -2e-11 |
| value_loss | 1.36e-15 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 442 |
| iterations | 84400 |
| time_elapsed | 952 |
| total_timesteps | 422000 |
| train/ | |
| entropy_loss | -0.0133 |
| explained_variance | 0.166 |
| learning_rate | 0.0007 |
| n_updates | 84399 |
| policy_loss | 5.06e-11 |
| value_loss | 6.95e-16 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 84500 |
| time_elapsed | 953 |
| total_timesteps | 422500 |
| train/ | |
| entropy_loss | -0.0133 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 84499 |
| policy_loss | -2.79e-12 |
| value_loss | 2.85e-18 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 84600 |
| time_elapsed | 954 |
| total_timesteps | 423000 |
| train/ | |
| entropy_loss | -0.0387 |
| explained_variance | 2.85e-05 |
| learning_rate | 0.0007 |
| n_updates | 84599 |
| policy_loss | -1.64e-05 |
| value_loss | 7.9e-06 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 84700 |
| time_elapsed | 955 |
| total_timesteps | 423500 |
| train/ | |
| entropy_loss | -0.0971 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 84699 |
| policy_loss | 5.25 |
| value_loss | 34.1 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 84800 |
| time_elapsed | 957 |
| total_timesteps | 424000 |
| train/ | |
| entropy_loss | -0.193 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 84799 |
| policy_loss | 0.000275 |
| value_loss | 3.79e-05 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 442 |
| iterations | 84900 |
| time_elapsed | 958 |
| total_timesteps | 424500 |
| train/ | |
| entropy_loss | -0.505 |
| explained_variance | -1.26e-05 |
| learning_rate | 0.0007 |
| n_updates | 84899 |
| policy_loss | -0.00521 |
| value_loss | 0.000639 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 85000 |
| time_elapsed | 959 |
| total_timesteps | 425000 |
| train/ | |
| entropy_loss | -0.484 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 84999 |
| policy_loss | 0.00133 |
| value_loss | 4.95e-05 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 85100 |
| time_elapsed | 960 |
| total_timesteps | 425500 |
| train/ | |
| entropy_loss | -0.681 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 85099 |
| policy_loss | 0.666 |
| value_loss | 1.39 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 85200 |
| time_elapsed | 961 |
| total_timesteps | 426000 |
| train/ | |
| entropy_loss | -0.621 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 85199 |
| policy_loss | -0.00659 |
| value_loss | 0.000378 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 85300 |
| time_elapsed | 961 |
| total_timesteps | 426500 |
| train/ | |
| entropy_loss | -0.645 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 85299 |
| policy_loss | -1.11 |
| value_loss | 2.63 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 85400 |
| time_elapsed | 962 |
| total_timesteps | 427000 |
| train/ | |
| entropy_loss | -0.658 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 85399 |
| policy_loss | 0.814 |
| value_loss | 2.88 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 85500 |
| time_elapsed | 963 |
| total_timesteps | 427500 |
| train/ | |
| entropy_loss | -0.67 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 85499 |
| policy_loss | -1.63 |
| value_loss | 15.1 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 85600 |
| time_elapsed | 964 |
| total_timesteps | 428000 |
| train/ | |
| entropy_loss | -0.548 |
| explained_variance | 2.98e-07 |
| learning_rate | 0.0007 |
| n_updates | 85599 |
| policy_loss | -0.00391 |
| value_loss | 0.00782 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 85700 |
| time_elapsed | 965 |
| total_timesteps | 428500 |
| train/ | |
| entropy_loss | -0.233 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 85699 |
| policy_loss | 0.000664 |
| value_loss | 0.000131 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 85800 |
| time_elapsed | 966 |
| total_timesteps | 429000 |
| train/ | |
| entropy_loss | -0.199 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 85799 |
| policy_loss | -3.09e-06 |
| value_loss | 4.38e-09 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 85900 |
| time_elapsed | 967 |
| total_timesteps | 429500 |
| train/ | |
| entropy_loss | -0.204 |
| explained_variance | 0.000136 |
| learning_rate | 0.0007 |
| n_updates | 85899 |
| policy_loss | -2.82e-05 |
| value_loss | 3.41e-07 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 86000 |
| time_elapsed | 968 |
| total_timesteps | 430000 |
| train/ | |
| entropy_loss | -0.248 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 85999 |
| policy_loss | 0.177 |
| value_loss | 0.0849 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 86100 |
| time_elapsed | 969 |
| total_timesteps | 430500 |
| train/ | |
| entropy_loss | -0.26 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 86099 |
| policy_loss | -3.45e-05 |
| value_loss | 2.57e-07 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 86200 |
| time_elapsed | 970 |
| total_timesteps | 431000 |
| train/ | |
| entropy_loss | -0.509 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 86199 |
| policy_loss | -1.14 |
| value_loss | 5.13 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 86300 |
| time_elapsed | 972 |
| total_timesteps | 431500 |
| train/ | |
| entropy_loss | -0.313 |
| explained_variance | 1.67e-05 |
| learning_rate | 0.0007 |
| n_updates | 86299 |
| policy_loss | -0.000651 |
| value_loss | 5.2e-05 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 86400 |
| time_elapsed | 973 |
| total_timesteps | 432000 |
| train/ | |
| entropy_loss | -0.284 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 86399 |
| policy_loss | 1.99e-05 |
| value_loss | 6.56e-08 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 443 |
| iterations | 86500 |
| time_elapsed | 974 |
| total_timesteps | 432500 |
| train/ | |
| entropy_loss | -0.349 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 86499 |
| policy_loss | 0.00364 |
| value_loss | 0.00691 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 444 |
| iterations | 86600 |
| time_elapsed | 975 |
| total_timesteps | 433000 |
| train/ | |
| entropy_loss | -0.391 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 86599 |
| policy_loss | 1.02 |
| value_loss | 3.67 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 444 |
| iterations | 86700 |
| time_elapsed | 976 |
| total_timesteps | 433500 |
| train/ | |
| entropy_loss | -0.305 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 86699 |
| policy_loss | 0.0589 |
| value_loss | 0.0233 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 444 |
| iterations | 86800 |
| time_elapsed | 976 |
| total_timesteps | 434000 |
| train/ | |
| entropy_loss | -0.196 |
| explained_variance | 1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 86799 |
| policy_loss | -7.72e-05 |
| value_loss | 2.85e-06 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 444 |
| iterations | 86900 |
| time_elapsed | 977 |
| total_timesteps | 434500 |
| train/ | |
| entropy_loss | -0.115 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 86899 |
| policy_loss | -8.14e-05 |
| value_loss | 1.31e-05 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 444 |
| iterations | 87000 |
| time_elapsed | 978 |
| total_timesteps | 435000 |
| train/ | |
| entropy_loss | -0.195 |
| explained_variance | 1.79e-07 |
| learning_rate | 0.0007 |
| n_updates | 86999 |
| policy_loss | 0.000102 |
| value_loss | 5.09e-06 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 444 |
| iterations | 87100 |
| time_elapsed | 979 |
| total_timesteps | 435500 |
| train/ | |
| entropy_loss | -0.259 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 87099 |
| policy_loss | 0.000534 |
| value_loss | 6.18e-05 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 444 |
| iterations | 87200 |
| time_elapsed | 980 |
| total_timesteps | 436000 |
| train/ | |
| entropy_loss | -0.137 |
| explained_variance | -2.38e-07 |
| learning_rate | 0.0007 |
| n_updates | 87199 |
| policy_loss | -1.57 |
| value_loss | 3.75 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 444 |
| iterations | 87300 |
| time_elapsed | 981 |
| total_timesteps | 436500 |
| train/ | |
| entropy_loss | -0.156 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 87299 |
| policy_loss | 0.000135 |
| value_loss | 1.61e-05 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 444 |
| iterations | 87400 |
| time_elapsed | 982 |
| total_timesteps | 437000 |
| train/ | |
| entropy_loss | -0.0894 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 87399 |
| policy_loss | 2.45e-05 |
| value_loss | 2.27e-06 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 444 |
| iterations | 87500 |
| time_elapsed | 983 |
| total_timesteps | 437500 |
| train/ | |
| entropy_loss | -0.0593 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 87499 |
| policy_loss | 1.38e-05 |
| value_loss | 2.01e-06 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 444 |
| iterations | 87600 |
| time_elapsed | 985 |
| total_timesteps | 438000 |
| train/ | |
| entropy_loss | -0.0674 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 87599 |
| policy_loss | -6.5e-05 |
| value_loss | 3.23e-05 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 154 |
| time/ | |
| fps | 444 |
| iterations | 87700 |
| time_elapsed | 986 |
| total_timesteps | 438500 |
| train/ | |
| entropy_loss | -0.0309 |
| explained_variance | 4.86e-05 |
| learning_rate | 0.0007 |
| n_updates | 87699 |
| policy_loss | 1.09e-05 |
| value_loss | 6.04e-06 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 444 |
| iterations | 87800 |
| time_elapsed | 987 |
| total_timesteps | 439000 |
| train/ | |
| entropy_loss | -0.0172 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 87799 |
| policy_loss | 7.63e-09 |
| value_loss | 1.17e-11 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 444 |
| iterations | 87900 |
| time_elapsed | 988 |
| total_timesteps | 439500 |
| train/ | |
| entropy_loss | -0.0161 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 87899 |
| policy_loss | -1.21e-07 |
| value_loss | 3.42e-09 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 444 |
| iterations | 88000 |
| time_elapsed | 989 |
| total_timesteps | 440000 |
| train/ | |
| entropy_loss | -0.0161 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 87999 |
| policy_loss | -2.81e-11 |
| value_loss | 1.86e-16 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 444 |
| iterations | 88100 |
| time_elapsed | 990 |
| total_timesteps | 440500 |
| train/ | |
| entropy_loss | -0.0116 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 88099 |
| policy_loss | 3.03e-11 |
| value_loss | 4.64e-16 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 444 |
| iterations | 88200 |
| time_elapsed | 991 |
| total_timesteps | 441000 |
| train/ | |
| entropy_loss | -0.0116 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 88199 |
| policy_loss | -1.02e-11 |
| value_loss | 5.24e-17 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 88300 |
| time_elapsed | 992 |
| total_timesteps | 441500 |
| train/ | |
| entropy_loss | -0.0119 |
| explained_variance | -0.000309 |
| learning_rate | 0.0007 |
| n_updates | 88299 |
| policy_loss | -3.75e-07 |
| value_loss | 6.66e-08 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 88400 |
| time_elapsed | 992 |
| total_timesteps | 442000 |
| train/ | |
| entropy_loss | -0.0119 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 88399 |
| policy_loss | 1.89e-11 |
| value_loss | 1.7e-16 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 88500 |
| time_elapsed | 993 |
| total_timesteps | 442500 |
| train/ | |
| entropy_loss | -0.0397 |
| explained_variance | 0.013 |
| learning_rate | 0.0007 |
| n_updates | 88499 |
| policy_loss | -1.72e-08 |
| value_loss | 8.11e-12 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 88600 |
| time_elapsed | 994 |
| total_timesteps | 443000 |
| train/ | |
| entropy_loss | -0.0556 |
| explained_variance | -0.553 |
| learning_rate | 0.0007 |
| n_updates | 88599 |
| policy_loss | -1.36e-09 |
| value_loss | 2.86e-14 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 88700 |
| time_elapsed | 995 |
| total_timesteps | 443500 |
| train/ | |
| entropy_loss | -0.0336 |
| explained_variance | -0.000958 |
| learning_rate | 0.0007 |
| n_updates | 88699 |
| policy_loss | -6.15e-07 |
| value_loss | 1.57e-08 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 88800 |
| time_elapsed | 997 |
| total_timesteps | 444000 |
| train/ | |
| entropy_loss | -0.0243 |
| explained_variance | -0.536 |
| learning_rate | 0.0007 |
| n_updates | 88799 |
| policy_loss | 4.3e-10 |
| value_loss | 2.31e-14 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 88900 |
| time_elapsed | 998 |
| total_timesteps | 444500 |
| train/ | |
| entropy_loss | -0.0243 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 88899 |
| policy_loss | 1.82e-11 |
| value_loss | 2.95e-17 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 89000 |
| time_elapsed | 999 |
| total_timesteps | 445000 |
| train/ | |
| entropy_loss | -0.0164 |
| explained_variance | -99.8 |
| learning_rate | 0.0007 |
| n_updates | 88999 |
| policy_loss | -1.24e-10 |
| value_loss | 3.49e-15 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 89100 |
| time_elapsed | 1000 |
| total_timesteps | 445500 |
| train/ | |
| entropy_loss | -0.0144 |
| explained_variance | 5.96e-08 |
| learning_rate | 0.0007 |
| n_updates | 89099 |
| policy_loss | 1.1e-11 |
| value_loss | 3.68e-17 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 89200 |
| time_elapsed | 1001 |
| total_timesteps | 446000 |
| train/ | |
| entropy_loss | -0.0144 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 89199 |
| policy_loss | 1.47e-11 |
| value_loss | 6.57e-17 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 89300 |
| time_elapsed | 1002 |
| total_timesteps | 446500 |
| train/ | |
| entropy_loss | -0.00828 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 89299 |
| policy_loss | -1.81e-11 |
| value_loss | 3.58e-16 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 89400 |
| time_elapsed | 1003 |
| total_timesteps | 447000 |
| train/ | |
| entropy_loss | -0.00411 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 89399 |
| policy_loss | 5.73e-12 |
| value_loss | 1.77e-16 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 89500 |
| time_elapsed | 1004 |
| total_timesteps | 447500 |
| train/ | |
| entropy_loss | -0.00411 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 89499 |
| policy_loss | -6.82e-13 |
| value_loss | 6.56e-16 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 89600 |
| time_elapsed | 1005 |
| total_timesteps | 448000 |
| train/ | |
| entropy_loss | -0.00772 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 89599 |
| policy_loss | -5.49e-11 |
| value_loss | 2.53e-15 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 89700 |
| time_elapsed | 1006 |
| total_timesteps | 448500 |
| train/ | |
| entropy_loss | -0.00772 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 89699 |
| policy_loss | 2.74e-12 |
| value_loss | 9.62e-18 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 89800 |
| time_elapsed | 1007 |
| total_timesteps | 449000 |
| train/ | |
| entropy_loss | -0.0138 |
| explained_variance | -0.574 |
| learning_rate | 0.0007 |
| n_updates | 89799 |
| policy_loss | -3.17e-10 |
| value_loss | 3.6e-14 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 89900 |
| time_elapsed | 1008 |
| total_timesteps | 449500 |
| train/ | |
| entropy_loss | -0.0138 |
| explained_variance | -970 |
| learning_rate | 0.0007 |
| n_updates | 89899 |
| policy_loss | 1.2e-10 |
| value_loss | 4.73e-15 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 90000 |
| time_elapsed | 1009 |
| total_timesteps | 450000 |
| train/ | |
| entropy_loss | -0.0375 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 89999 |
| policy_loss | -9.95e-07 |
| value_loss | 3.15e-08 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 90100 |
| time_elapsed | 1010 |
| total_timesteps | 450500 |
| train/ | |
| entropy_loss | -0.0375 |
| explained_variance | -103 |
| learning_rate | 0.0007 |
| n_updates | 90099 |
| policy_loss | -1.07e-10 |
| value_loss | 3.54e-17 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 90200 |
| time_elapsed | 1011 |
| total_timesteps | 451000 |
| train/ | |
| entropy_loss | -0.0272 |
| explained_variance | 0.0147 |
| learning_rate | 0.0007 |
| n_updates | 90199 |
| policy_loss | 3.09e-08 |
| value_loss | 6.53e-11 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 90300 |
| time_elapsed | 1012 |
| total_timesteps | 451500 |
| train/ | |
| entropy_loss | -0.0184 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 90299 |
| policy_loss | 3.12e-08 |
| value_loss | 1.68e-10 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 90400 |
| time_elapsed | 1013 |
| total_timesteps | 452000 |
| train/ | |
| entropy_loss | -0.0116 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 90399 |
| policy_loss | -5.55e-13 |
| value_loss | 1.56e-19 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 90500 |
| time_elapsed | 1014 |
| total_timesteps | 452500 |
| train/ | |
| entropy_loss | -0.0116 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 90499 |
| policy_loss | 6.41e-14 |
| value_loss | 2.08e-21 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 445 |
| iterations | 90600 |
| time_elapsed | 1015 |
| total_timesteps | 453000 |
| train/ | |
| entropy_loss | -0.0116 |
| explained_variance | -104 |
| learning_rate | 0.0007 |
| n_updates | 90599 |
| policy_loss | 6.47e-11 |
| value_loss | 8.02e-16 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 446 |
| iterations | 90700 |
| time_elapsed | 1016 |
| total_timesteps | 453500 |
| train/ | |
| entropy_loss | -0.0232 |
| explained_variance | 0.00573 |
| learning_rate | 0.0007 |
| n_updates | 90699 |
| policy_loss | -6.59e-08 |
| value_loss | 4.34e-10 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 446 |
| iterations | 90800 |
| time_elapsed | 1017 |
| total_timesteps | 454000 |
| train/ | |
| entropy_loss | -0.0787 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 90799 |
| policy_loss | -7.2e-05 |
| value_loss | 2.7e-05 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 446 |
| iterations | 90900 |
| time_elapsed | 1018 |
| total_timesteps | 454500 |
| train/ | |
| entropy_loss | -0.0932 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 90899 |
| policy_loss | 1.29 |
| value_loss | 1.98 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 446 |
| iterations | 91000 |
| time_elapsed | 1019 |
| total_timesteps | 455000 |
| train/ | |
| entropy_loss | -0.0605 |
| explained_variance | 4.17e-07 |
| learning_rate | 0.0007 |
| n_updates | 90999 |
| policy_loss | 9.99e-06 |
| value_loss | 9.93e-07 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 446 |
| iterations | 91100 |
| time_elapsed | 1020 |
| total_timesteps | 455500 |
| train/ | |
| entropy_loss | -0.0667 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 91099 |
| policy_loss | -2.87e-07 |
| value_loss | 6.45e-10 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 446 |
| iterations | 91200 |
| time_elapsed | 1021 |
| total_timesteps | 456000 |
| train/ | |
| entropy_loss | -0.0509 |
| explained_variance | 0.00102 |
| learning_rate | 0.0007 |
| n_updates | 91199 |
| policy_loss | 9.51e-07 |
| value_loss | 1.37e-08 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 446 |
| iterations | 91300 |
| time_elapsed | 1022 |
| total_timesteps | 456500 |
| train/ | |
| entropy_loss | -0.105 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 91299 |
| policy_loss | -3.57e-05 |
| value_loss | 3.22e-06 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 446 |
| iterations | 91400 |
| time_elapsed | 1023 |
| total_timesteps | 457000 |
| train/ | |
| entropy_loss | -0.122 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 91399 |
| policy_loss | 4.05 |
| value_loss | 18 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 446 |
| iterations | 91500 |
| time_elapsed | 1023 |
| total_timesteps | 457500 |
| train/ | |
| entropy_loss | -0.253 |
| explained_variance | 1.79e-07 |
| learning_rate | 0.0007 |
| n_updates | 91499 |
| policy_loss | 0.585 |
| value_loss | 0.268 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 446 |
| iterations | 91600 |
| time_elapsed | 1025 |
| total_timesteps | 458000 |
| train/ | |
| entropy_loss | -0.515 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 91599 |
| policy_loss | -0.00443 |
| value_loss | 0.000428 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 446 |
| iterations | 91700 |
| time_elapsed | 1026 |
| total_timesteps | 458500 |
| train/ | |
| entropy_loss | -0.544 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 91699 |
| policy_loss | -0.00251 |
| value_loss | 0.000108 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 446 |
| iterations | 91800 |
| time_elapsed | 1027 |
| total_timesteps | 459000 |
| train/ | |
| entropy_loss | -0.691 |
| explained_variance | 1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 91799 |
| policy_loss | 1.48 |
| value_loss | 5.36 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 446 |
| iterations | 91900 |
| time_elapsed | 1028 |
| total_timesteps | 459500 |
| train/ | |
| entropy_loss | -0.642 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 91899 |
| policy_loss | -0.673 |
| value_loss | 0.723 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 446 |
| iterations | 92000 |
| time_elapsed | 1029 |
| total_timesteps | 460000 |
| train/ | |
| entropy_loss | -0.669 |
| explained_variance | 3.58e-07 |
| learning_rate | 0.0007 |
| n_updates | 91999 |
| policy_loss | 0.273 |
| value_loss | 0.189 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 446 |
| iterations | 92100 |
| time_elapsed | 1030 |
| total_timesteps | 460500 |
| train/ | |
| entropy_loss | -0.689 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 92099 |
| policy_loss | -0.221 |
| value_loss | 0.0953 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 446 |
| iterations | 92200 |
| time_elapsed | 1031 |
| total_timesteps | 461000 |
| train/ | |
| entropy_loss | -0.556 |
| explained_variance | 1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 92199 |
| policy_loss | -0.553 |
| value_loss | 1 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 447 |
| iterations | 92300 |
| time_elapsed | 1032 |
| total_timesteps | 461500 |
| train/ | |
| entropy_loss | -0.589 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 92299 |
| policy_loss | 0.0107 |
| value_loss | 0.00134 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 447 |
| iterations | 92400 |
| time_elapsed | 1033 |
| total_timesteps | 462000 |
| train/ | |
| entropy_loss | -0.576 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 92399 |
| policy_loss | 0.00828 |
| value_loss | 0.000899 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 447 |
| iterations | 92500 |
| time_elapsed | 1034 |
| total_timesteps | 462500 |
| train/ | |
| entropy_loss | -0.337 |
| explained_variance | 2.38e-07 |
| learning_rate | 0.0007 |
| n_updates | 92499 |
| policy_loss | 0.62 |
| value_loss | 0.401 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 447 |
| iterations | 92600 |
| time_elapsed | 1035 |
| total_timesteps | 463000 |
| train/ | |
| entropy_loss | -0.687 |
| explained_variance | 1.79e-07 |
| learning_rate | 0.0007 |
| n_updates | 92599 |
| policy_loss | -0.674 |
| value_loss | 1.72 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 447 |
| iterations | 92700 |
| time_elapsed | 1036 |
| total_timesteps | 463500 |
| train/ | |
| entropy_loss | -0.443 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 92699 |
| policy_loss | -0.361 |
| value_loss | 0.528 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 447 |
| iterations | 92800 |
| time_elapsed | 1037 |
| total_timesteps | 464000 |
| train/ | |
| entropy_loss | -0.152 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 92799 |
| policy_loss | 0.00028 |
| value_loss | 7.46e-05 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 447 |
| iterations | 92900 |
| time_elapsed | 1037 |
| total_timesteps | 464500 |
| train/ | |
| entropy_loss | -0.151 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 92899 |
| policy_loss | 5.14e-05 |
| value_loss | 2.58e-06 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 447 |
| iterations | 93000 |
| time_elapsed | 1039 |
| total_timesteps | 465000 |
| train/ | |
| entropy_loss | -0.205 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 92999 |
| policy_loss | 1.5 |
| value_loss | 2.48 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 447 |
| iterations | 93100 |
| time_elapsed | 1040 |
| total_timesteps | 465500 |
| train/ | |
| entropy_loss | -0.314 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 93099 |
| policy_loss | 2.45 |
| value_loss | 16.5 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 447 |
| iterations | 93200 |
| time_elapsed | 1041 |
| total_timesteps | 466000 |
| train/ | |
| entropy_loss | -0.163 |
| explained_variance | -2.31e-05 |
| learning_rate | 0.0007 |
| n_updates | 93199 |
| policy_loss | 6.07e-05 |
| value_loss | 2.93e-06 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 447 |
| iterations | 93300 |
| time_elapsed | 1042 |
| total_timesteps | 466500 |
| train/ | |
| entropy_loss | -0.284 |
| explained_variance | -2.38e-07 |
| learning_rate | 0.0007 |
| n_updates | 93299 |
| policy_loss | -0.57 |
| value_loss | 0.532 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 447 |
| iterations | 93400 |
| time_elapsed | 1043 |
| total_timesteps | 467000 |
| train/ | |
| entropy_loss | -0.384 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 93399 |
| policy_loss | -0.000459 |
| value_loss | 1.35e-05 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 447 |
| iterations | 93500 |
| time_elapsed | 1044 |
| total_timesteps | 467500 |
| train/ | |
| entropy_loss | -0.535 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 93499 |
| policy_loss | -0.00197 |
| value_loss | 7.17e-05 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 447 |
| iterations | 93600 |
| time_elapsed | 1045 |
| total_timesteps | 468000 |
| train/ | |
| entropy_loss | -0.693 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 93599 |
| policy_loss | -5.16 |
| value_loss | 56.8 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 447 |
| iterations | 93700 |
| time_elapsed | 1046 |
| total_timesteps | 468500 |
| train/ | |
| entropy_loss | -0.693 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 93699 |
| policy_loss | -0.248 |
| value_loss | 0.486 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 447 |
| iterations | 93800 |
| time_elapsed | 1047 |
| total_timesteps | 469000 |
| train/ | |
| entropy_loss | -0.689 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 93799 |
| policy_loss | 1.19 |
| value_loss | 3.35 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 152 |
| time/ | |
| fps | 447 |
| iterations | 93900 |
| time_elapsed | 1048 |
| total_timesteps | 469500 |
| train/ | |
| entropy_loss | -0.673 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 93899 |
| policy_loss | -1.89 |
| value_loss | 8.17 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 447 |
| iterations | 94000 |
| time_elapsed | 1049 |
| total_timesteps | 470000 |
| train/ | |
| entropy_loss | -0.693 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 93999 |
| policy_loss | 0.373 |
| value_loss | 0.576 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 447 |
| iterations | 94100 |
| time_elapsed | 1050 |
| total_timesteps | 470500 |
| train/ | |
| entropy_loss | -0.634 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 94099 |
| policy_loss | 0.406 |
| value_loss | 3.03 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 94200 |
| time_elapsed | 1051 |
| total_timesteps | 471000 |
| train/ | |
| entropy_loss | -0.666 |
| explained_variance | 1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 94199 |
| policy_loss | 0.49 |
| value_loss | 0.873 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 94300 |
| time_elapsed | 1052 |
| total_timesteps | 471500 |
| train/ | |
| entropy_loss | -0.364 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 94299 |
| policy_loss | 0.000687 |
| value_loss | 3.6e-05 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 94400 |
| time_elapsed | 1053 |
| total_timesteps | 472000 |
| train/ | |
| entropy_loss | -0.145 |
| explained_variance | -1.66e-05 |
| learning_rate | 0.0007 |
| n_updates | 94399 |
| policy_loss | 0.000147 |
| value_loss | 2.32e-05 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 94500 |
| time_elapsed | 1054 |
| total_timesteps | 472500 |
| train/ | |
| entropy_loss | -0.185 |
| explained_variance | 1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 94499 |
| policy_loss | -5.27e-05 |
| value_loss | 1.56e-06 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 447 |
| iterations | 94600 |
| time_elapsed | 1055 |
| total_timesteps | 473000 |
| train/ | |
| entropy_loss | -0.154 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 94599 |
| policy_loss | 0.000101 |
| value_loss | 9.43e-06 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 94700 |
| time_elapsed | 1056 |
| total_timesteps | 473500 |
| train/ | |
| entropy_loss | -0.193 |
| explained_variance | 1.79e-07 |
| learning_rate | 0.0007 |
| n_updates | 94699 |
| policy_loss | 0.000325 |
| value_loss | 5.29e-05 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 94800 |
| time_elapsed | 1057 |
| total_timesteps | 474000 |
| train/ | |
| entropy_loss | -0.384 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 94799 |
| policy_loss | 0.000291 |
| value_loss | 5.42e-06 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 94900 |
| time_elapsed | 1058 |
| total_timesteps | 474500 |
| train/ | |
| entropy_loss | -0.168 |
| explained_variance | 1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 94899 |
| policy_loss | 6.17e-05 |
| value_loss | 2.79e-06 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 95000 |
| time_elapsed | 1059 |
| total_timesteps | 475000 |
| train/ | |
| entropy_loss | -0.234 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 94999 |
| policy_loss | -0.00011 |
| value_loss | 3.5e-06 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 95100 |
| time_elapsed | 1060 |
| total_timesteps | 475500 |
| train/ | |
| entropy_loss | -0.39 |
| explained_variance | 1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 95099 |
| policy_loss | 0.684 |
| value_loss | 0.687 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 95200 |
| time_elapsed | 1061 |
| total_timesteps | 476000 |
| train/ | |
| entropy_loss | -0.206 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 95199 |
| policy_loss | 0.000313 |
| value_loss | 4.1e-05 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 95300 |
| time_elapsed | 1062 |
| total_timesteps | 476500 |
| train/ | |
| entropy_loss | -0.205 |
| explained_variance | 1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 95299 |
| policy_loss | -0.000106 |
| value_loss | 4.71e-06 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 95400 |
| time_elapsed | 1063 |
| total_timesteps | 477000 |
| train/ | |
| entropy_loss | -0.264 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 95399 |
| policy_loss | -0.000161 |
| value_loss | 5.36e-06 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 95500 |
| time_elapsed | 1064 |
| total_timesteps | 477500 |
| train/ | |
| entropy_loss | -0.307 |
| explained_variance | 1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 95499 |
| policy_loss | 1.33 |
| value_loss | 4.85 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 95600 |
| time_elapsed | 1065 |
| total_timesteps | 478000 |
| train/ | |
| entropy_loss | -0.174 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 95599 |
| policy_loss | 0.000102 |
| value_loss | 6.92e-06 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 95700 |
| time_elapsed | 1066 |
| total_timesteps | 478500 |
| train/ | |
| entropy_loss | -0.121 |
| explained_variance | 1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 95699 |
| policy_loss | 0.000158 |
| value_loss | 4.29e-05 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 95800 |
| time_elapsed | 1067 |
| total_timesteps | 479000 |
| train/ | |
| entropy_loss | -0.102 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 95799 |
| policy_loss | 9.15 |
| value_loss | 129 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 95900 |
| time_elapsed | 1068 |
| total_timesteps | 479500 |
| train/ | |
| entropy_loss | -0.102 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 95899 |
| policy_loss | 7.13e-06 |
| value_loss | 1.37e-07 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 96000 |
| time_elapsed | 1069 |
| total_timesteps | 480000 |
| train/ | |
| entropy_loss | -0.187 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 95999 |
| policy_loss | -4.45e-05 |
| value_loss | 1.07e-06 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 96100 |
| time_elapsed | 1070 |
| total_timesteps | 480500 |
| train/ | |
| entropy_loss | -0.211 |
| explained_variance | -2.38e-07 |
| learning_rate | 0.0007 |
| n_updates | 96099 |
| policy_loss | -0.000169 |
| value_loss | 1.12e-05 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 96200 |
| time_elapsed | 1071 |
| total_timesteps | 481000 |
| train/ | |
| entropy_loss | -0.271 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 96199 |
| policy_loss | -0.0044 |
| value_loss | 0.0171 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 96300 |
| time_elapsed | 1072 |
| total_timesteps | 481500 |
| train/ | |
| entropy_loss | -0.207 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 96299 |
| policy_loss | 0.000343 |
| value_loss | 4.83e-05 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 96400 |
| time_elapsed | 1073 |
| total_timesteps | 482000 |
| train/ | |
| entropy_loss | -0.151 |
| explained_variance | 1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 96399 |
| policy_loss | 0.00149 |
| value_loss | 0.00832 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 448 |
| iterations | 96500 |
| time_elapsed | 1074 |
| total_timesteps | 482500 |
| train/ | |
| entropy_loss | -0.201 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 96499 |
| policy_loss | -0.000261 |
| value_loss | 3.04e-05 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 96600 |
| time_elapsed | 1075 |
| total_timesteps | 483000 |
| train/ | |
| entropy_loss | -0.0724 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 96599 |
| policy_loss | 1.64e-05 |
| value_loss | 1.72e-06 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 96700 |
| time_elapsed | 1076 |
| total_timesteps | 483500 |
| train/ | |
| entropy_loss | -0.0327 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 96699 |
| policy_loss | 4.81e-06 |
| value_loss | 1.02e-06 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 96800 |
| time_elapsed | 1077 |
| total_timesteps | 484000 |
| train/ | |
| entropy_loss | -0.0816 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 96799 |
| policy_loss | -3.43e-05 |
| value_loss | 5.6e-06 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 96900 |
| time_elapsed | 1078 |
| total_timesteps | 484500 |
| train/ | |
| entropy_loss | -0.0375 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 96899 |
| policy_loss | 1.52e-06 |
| value_loss | 7.32e-08 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 97000 |
| time_elapsed | 1079 |
| total_timesteps | 485000 |
| train/ | |
| entropy_loss | -0.0869 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 96999 |
| policy_loss | -3.72e-05 |
| value_loss | 5.61e-06 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 97100 |
| time_elapsed | 1080 |
| total_timesteps | 485500 |
| train/ | |
| entropy_loss | -0.0329 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 97099 |
| policy_loss | 4.25e-07 |
| value_loss | 7.87e-09 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 97200 |
| time_elapsed | 1081 |
| total_timesteps | 486000 |
| train/ | |
| entropy_loss | -0.021 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 97199 |
| policy_loss | 1.39e-06 |
| value_loss | 2.44e-07 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 97300 |
| time_elapsed | 1082 |
| total_timesteps | 486500 |
| train/ | |
| entropy_loss | -0.0276 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 97299 |
| policy_loss | -1.46e-09 |
| value_loss | 1.41e-13 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 97400 |
| time_elapsed | 1083 |
| total_timesteps | 487000 |
| train/ | |
| entropy_loss | -0.0211 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 97399 |
| policy_loss | 2.51e-06 |
| value_loss | 7.89e-07 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 97500 |
| time_elapsed | 1084 |
| total_timesteps | 487500 |
| train/ | |
| entropy_loss | -0.00938 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 97499 |
| policy_loss | 6.03e-07 |
| value_loss | 2.99e-07 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 97600 |
| time_elapsed | 1085 |
| total_timesteps | 488000 |
| train/ | |
| entropy_loss | -0.00938 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 97599 |
| policy_loss | -5.25e-12 |
| value_loss | 2.27e-17 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 97700 |
| time_elapsed | 1086 |
| total_timesteps | 488500 |
| train/ | |
| entropy_loss | -0.00938 |
| explained_variance | 1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 97699 |
| policy_loss | -3.57e-12 |
| value_loss | 1.05e-17 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 97800 |
| time_elapsed | 1087 |
| total_timesteps | 489000 |
| train/ | |
| entropy_loss | -0.00938 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 97799 |
| policy_loss | 1.89e-11 |
| value_loss | 2.93e-16 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 97900 |
| time_elapsed | 1088 |
| total_timesteps | 489500 |
| train/ | |
| entropy_loss | -0.00938 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 97899 |
| policy_loss | 7.04e-13 |
| value_loss | 4.08e-19 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 98000 |
| time_elapsed | 1089 |
| total_timesteps | 490000 |
| train/ | |
| entropy_loss | -0.00938 |
| explained_variance | -0.522 |
| learning_rate | 0.0007 |
| n_updates | 97999 |
| policy_loss | 1.53e-10 |
| value_loss | 2.01e-14 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 98100 |
| time_elapsed | 1090 |
| total_timesteps | 490500 |
| train/ | |
| entropy_loss | -0.00938 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 98099 |
| policy_loss | 2.52e-13 |
| value_loss | 5.2e-20 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 98200 |
| time_elapsed | 1091 |
| total_timesteps | 491000 |
| train/ | |
| entropy_loss | -0.014 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 98199 |
| policy_loss | 2.61e-05 |
| value_loss | 0.000221 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 98300 |
| time_elapsed | 1092 |
| total_timesteps | 491500 |
| train/ | |
| entropy_loss | -0.0139 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 98299 |
| policy_loss | 1.42e-11 |
| value_loss | 6.64e-17 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 98400 |
| time_elapsed | 1093 |
| total_timesteps | 492000 |
| train/ | |
| entropy_loss | -0.0139 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 98399 |
| policy_loss | 8.92e-12 |
| value_loss | 2.61e-17 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 98500 |
| time_elapsed | 1094 |
| total_timesteps | 492500 |
| train/ | |
| entropy_loss | -0.0139 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 98499 |
| policy_loss | -4.86e-12 |
| value_loss | 7.75e-18 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 98600 |
| time_elapsed | 1095 |
| total_timesteps | 493000 |
| train/ | |
| entropy_loss | -0.0139 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 98599 |
| policy_loss | -1.65e-12 |
| value_loss | 8.89e-19 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 98700 |
| time_elapsed | 1097 |
| total_timesteps | 493500 |
| train/ | |
| entropy_loss | -0.0139 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 98699 |
| policy_loss | -4.78e-13 |
| value_loss | 7.05e-16 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 98800 |
| time_elapsed | 1098 |
| total_timesteps | 494000 |
| train/ | |
| entropy_loss | -0.0301 |
| explained_variance | 1.19e-07 |
| learning_rate | 0.0007 |
| n_updates | 98799 |
| policy_loss | -4.15e-07 |
| value_loss | 9.23e-09 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 98900 |
| time_elapsed | 1099 |
| total_timesteps | 494500 |
| train/ | |
| entropy_loss | -0.0241 |
| explained_variance | -0.0343 |
| learning_rate | 0.0007 |
| n_updates | 98899 |
| policy_loss | 7.61e-09 |
| value_loss | 5.37e-12 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 449 |
| iterations | 99000 |
| time_elapsed | 1100 |
| total_timesteps | 495000 |
| train/ | |
| entropy_loss | -0.043 |
| explained_variance | 0.014 |
| learning_rate | 0.0007 |
| n_updates | 98999 |
| policy_loss | -3.73e-08 |
| value_loss | 3.17e-11 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 450 |
| iterations | 99100 |
| time_elapsed | 1100 |
| total_timesteps | 495500 |
| train/ | |
| entropy_loss | -0.0516 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 99099 |
| policy_loss | 4.01e-06 |
| value_loss | 2.37e-07 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 450 |
| iterations | 99200 |
| time_elapsed | 1101 |
| total_timesteps | 496000 |
| train/ | |
| entropy_loss | -0.0165 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 99199 |
| policy_loss | 4.7e-06 |
| value_loss | 4.89e-06 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 450 |
| iterations | 99300 |
| time_elapsed | 1102 |
| total_timesteps | 496500 |
| train/ | |
| entropy_loss | -0.0165 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 99299 |
| policy_loss | -1.39e-12 |
| value_loss | 4.28e-19 |
-------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 450 |
| iterations | 99400 |
| time_elapsed | 1103 |
| total_timesteps | 497000 |
| train/ | |
| entropy_loss | -0.0169 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 99399 |
| policy_loss | 1.9e-05 |
| value_loss | 7.52e-05 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 450 |
| iterations | 99500 |
| time_elapsed | 1104 |
| total_timesteps | 497500 |
| train/ | |
| entropy_loss | -0.037 |
| explained_variance | -0.00043 |
| learning_rate | 0.0007 |
| n_updates | 99499 |
| policy_loss | 1.02e-06 |
| value_loss | 3.41e-08 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 450 |
| iterations | 99600 |
| time_elapsed | 1105 |
| total_timesteps | 498000 |
| train/ | |
| entropy_loss | -0.0132 |
| explained_variance | 0.00032 |
| learning_rate | 0.0007 |
| n_updates | 99599 |
| policy_loss | 2.07e-07 |
| value_loss | 1.6e-08 |
------------------------------------
------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 450 |
| iterations | 99700 |
| time_elapsed | 1106 |
| total_timesteps | 498500 |
| train/ | |
| entropy_loss | -0.00807 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 99699 |
| policy_loss | 7.33e-10 |
| value_loss | 6.23e-13 |
------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 450 |
| iterations | 99800 |
| time_elapsed | 1107 |
| total_timesteps | 499000 |
| train/ | |
| entropy_loss | -0.0131 |
| explained_variance | 0 |
| learning_rate | 0.0007 |
| n_updates | 99799 |
| policy_loss | -8.55e-06 |
| value_loss | 2.8e-05 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 450 |
| iterations | 99900 |
| time_elapsed | 1108 |
| total_timesteps | 499500 |
| train/ | |
| entropy_loss | -0.0131 |
| explained_variance | 1.79e-07 |
| learning_rate | 0.0007 |
| n_updates | 99899 |
| policy_loss | -1.65e-11 |
| value_loss | 1.04e-16 |
-------------------------------------
-------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | 168 |
| time/ | |
| fps | 450 |
| iterations | 100000 |
| time_elapsed | 1109 |
| total_timesteps | 500000 |
| train/ | |
| entropy_loss | -0.00618 |
| explained_variance | 0.000335 |
| learning_rate | 0.0007 |
| n_updates | 99999 |
| policy_loss | -1.62e-07 |
| value_loss | 5.63e-08 |
-------------------------------------
<stable_baselines3.a2c.a2c.A2C at 0x79ae39e2bd30>
action_stats = {Actions.Sell: 0, Actions.Buy: 0}
observation, info = my_custom_env.reset()
while True:
action, _states = model_a2c.predict(observation)
action_stats[Actions(action)] += 1
observation, reward, terminated, truncated, info = my_custom_env.step(action)
done = terminated or truncated
if done:
break
my_custom_env.close()
print("action_stats:", action_stats)
print("info:", info)
action_stats: {<Actions.Sell: 0>: 26, <Actions.Buy: 1>: 31300}
info: {'total_reward': 135.59937477111816, 'total_profit': 0.9932310359073705, 'position': <Positions.Long: 1>}
plt.figure(figsize=(15,6))
plt.cla()
my_custom_env.render_all()
plt.show()
Profit and Reward¶
plt.figure(figsize=(15, 6))
plt.plot(my_custom_env.history['total_profit'], label='Price', color='gray', alpha=0.5)
[<matplotlib.lines.Line2D at 0x79ae5d385de0>]
plt.figure(figsize=(15, 6))
plt.plot(my_custom_env.history['total_reward'], label='Price', color='gray', alpha=0.5)
[<matplotlib.lines.Line2D at 0x79ae5d3f5c60>]
Test data¶
window_size = 180 # looking back 180 minutes (3 hour)
start_date = "2017-12-01"
end_date = "2017-12-31"
filtered_data_a2c_test = df2[start_date:end_date]
start_index = window_size
end_index = len(filtered_data_a2c_test)
test_env = CustomForex(df=filtered_data_a2c_test, window_size=window_size, frame_bound=(start_index, end_index))
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats_a2c_test = {Actions.Sell: 0, Actions.Buy: 0}
observation, info = test_env.reset(seed=seed)
while True:
action, _states = model_a2c.predict(observation)
action_stats_a2c_test[Actions(action)] += 1
observation, reward, terminated, truncated, info = test_env.step(action)
done = terminated or truncated
if done:
break
test_env.close()
print("action_stats:", action_stats_a2c_test)
print("info:", info)
action_stats: {<Actions.Sell: 0>: 21, <Actions.Buy: 1>: 26888}
info: {'total_reward': 91.00079536437988, 'total_profit': 0.9950589185376948, 'position': <Positions.Long: 1>}
plt.figure(figsize=(15,6))
plt.cla()
test_env.render_all()
plt.show()
Profit and reward test¶
print('PROFIT')
plt.figure(figsize=(15, 6))
plt.plot(test_env.history['total_profit'], label='Price', color='gray', alpha=0.5)
PROFIT
/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
[<matplotlib.lines.Line2D at 0x79ae5d08a1a0>]
print('REWARD')
plt.figure(figsize=(15, 6))
plt.plot(test_env.history['total_reward'], label='Price', color='gray', alpha=0.5)
REWARD
[<matplotlib.lines.Line2D at 0x79ae39204520>]
A2C Results¶
When fitting the A2c model to the EUR_USD March data, it has negative reward not frequently, while updating it with rather low increments. FInally, it reaches 135.6, showing the ability to learn.
The trading agent does not do many selling operations (26), preferring to buy and hold (31300). It makes A2C strategy similar to the one obtained by PPO. When using lower number of timesteps for training A2C, the total_profit falls drastically. Total_profit = 1 in the beginning, while the final profit shows, how much the trader gained or how much this value decreased. As for the training set, the total_profit in the end equals 0.993. This means that the agent did not gain money, but lost just 0.007 that is lower than 1%. If we train the model for less steps, the final total_profit becomes 0.74, showing the fall of 0.26, much bigger than 1%. That is why, there comes the conclusion that training A2C should be done for longer than PPO to provide more stability for a model.
When adressing the test set, which shows new market behaviour in December, the total reward decreases to 91, getting positive only in the very end. It falls lower than for training set. If comparing to PPO, stabilization over rewards happens faster for PPO than for A2C, which is why it is proposed to use more steps when possible for A2C to fit/train. Even though the market data follows a different pattern in December, the A2C model still suggests rather cautious strategy with only 21 sell action (means being either out of position or just sold an asset), most of the time either buying or holding - waiting for a proper moment. This makes sense, because December shows a new configuration of prices movement, by this making model more attentive to its actions.
The profit for test set is managed rather professionally with small proportion of money lost (0.005), even though initially the model was fitted to the month with another market behaviour.
2.2.2 PPO¶
Indeed, the differences in penalties between short and long positions serve a crucial purpose in incentivizing trading activity and preventing risk-averse behavior from the agent. Here's a more detailed explanation:
Encouraging Trading Activity:
- It is importnat for traders to actively engage in trading activities to capitalize on profitable opportunities
- The penalties for negative profits, especially when a trade results in a loss, play a vital role in discouraging risk-averse behavior where the agent refrains from taking actions or maintains a neutral position to avoid potential losses
Preventing Risk-Free Behavior:
- Without appropriate penalties, the agent may learn to adopt risk-free behaviors and avoid trades at all
- For instance, if the profit is updated only when the position changes, the agent might learn that constantly maintaining a position without executing trades (e.g., always selling without changing the position) leads to a constant profit in gym_anytrading
Differentiating Penalties for Short and Long Positions:
- The differentiation in penalties between short and long positions reflects the varying risk profiles associated with each type of trade
- The penalty for being out of market at all and for holding assets without liquidity in trades are different in nature, on the whole
- Gym_anytrading does not allow for short selling that is why there is no separate penalty for being in a position with liability
In conclusion, the differences in penalties for short and long positions aim to discourage risk-averse behavior, promote trading activity, and foster balanced risk management practices
window_size = 180
start_date = "2017-03-01"
end_date = "2017-03-31"
filtered_data = df[start_date:end_date]
start_index = window_size
end_index = len(filtered_data)
/usr/local/lib/python3.10/dist-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code)
Modified gym_anytrading environment¶
def my_process_data(env):
prices = env.df.loc[:, 'Close'].to_numpy() # use close price as current share price
prices = prices[env.frame_bound[0]-env.window_size:env.frame_bound[1]]
diff = np.insert(np.diff(prices), 0, 0)#
# additional indicators
try:
sma = env.df.loc[:, 'SMA'].to_numpy()
macd = env.df.loc[:, 'MACD_SIG'].to_numpy()
rsi = env.df.loc[:, 'RSI'].to_numpy()
sma = sma[env.frame_bound[0]-env.window_size:env.frame_bound[1]]
macd = macd[env.frame_bound[0]-env.window_size:env.frame_bound[1]]
rsi = rsi[env.frame_bound[0]-env.window_size:env.frame_bound[1]]
signal_features = np.column_stack((prices, diff, sma, macd, rsi))
except:
print("(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)")
signal_features = np.column_stack((prices, diff))
return prices.astype(np.float32), signal_features.astype(np.float32)
class MyStocksEnv(ForexEnv):
_process_data = my_process_data
def __init__(self, df, window_size, frame_bound):
super().__init__(df, window_size, frame_bound)
self.trade_log = []
def step(self, action):
self._truncated = False
self._current_tick += 1
if self._current_tick == self._end_tick:
self._truncated = True
step_reward = self._calculate_reward(action)
self._total_reward += step_reward
self._update_profit(action)
trade = False
if (
(action == Actions.Buy.value and self._position == Positions.Short) or
(action == Actions.Sell.value and self._position == Positions.Long)
):
trade = True
if trade:
# Log the trade details
trade_info = {
'step': self._current_tick,
'action': action,
'price': self.prices[self._current_tick],
'from position':self._position
#'profit': profit,
}
self.trade_log.append(trade_info)
self._position = self._position.opposite()
self._last_trade_tick = self._current_tick
#self.action_stats[Actions(action)] += 1
self._position_history.append(self._position)
observation = self._get_observation()
info = self._get_info()
self._update_history(info)
if self.render_mode == 'human':
self._render_frame()
return observation, step_reward, False, self._truncated, info
def _calculate_reward(self, action):
step_reward = 0 # pip
trade = False
if (
(action == Actions.Buy.value and self._position == Positions.Short) or
(action == Actions.Sell.value and self._position == Positions.Long)
):
trade = True
if trade:
current_price = self.prices[self._current_tick]
last_trade_price = self.prices[self._last_trade_tick]
price_diff = current_price - last_trade_price
# Calculate profit in pips
profit = price_diff * 10000
if self._position == Positions.Short:
# Deduct transaction costs for short positions
transaction_cost = abs(profit) * self.trade_fee
step_reward += profit - transaction_cost
# Penalize negative profits further
if profit < 0:
penalty_for_loss = 10 # To identify
step_reward -= penalty_for_loss
if profit >0:
win_premium = 7
step_reward += win_premium
elif self._position == Positions.Long:
# Deduct transaction costs for long positions
transaction_cost = abs(profit) * self.trade_fee
step_reward += profit - transaction_cost
# Penalize negative profits further
if profit < 0:
penalty_for_loss = 7 # To identify
step_reward -= penalty_for_loss
if profit >0:
win_premium = 7
step_reward += win_premium
return step_reward
Model Training¶
my_forex_env = MyStocksEnv(df=filtered_data, window_size=window_size, frame_bound=(start_index, end_index))
model = PPO("MlpPolicy", my_forex_env, verbose=1, learning_rate = 0.0001, seed= seed)
model.learn(total_timesteps=500000)
model.save("ppo_forex")
# the comment in yellow says that the output data was cut to 5000 lines
Выходные данные были обрезаны до нескольких последних строк (5000).
| entropy_loss | -0.691 |
| explained_variance | 3.58e-07 |
| learning_rate | 0.0001 |
| loss | 123 |
| n_updates | 50 |
| policy_gradient_loss | -0.00523 |
| value_loss | 280 |
-----------------------------------------
------------------------------------------
| time/ | |
| fps | 411 |
| iterations | 7 |
| time_elapsed | 34 |
| total_timesteps | 14336 |
| train/ | |
| approx_kl | 0.0049569034 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.692 |
| explained_variance | 7.15e-07 |
| learning_rate | 0.0001 |
| loss | 155 |
| n_updates | 60 |
| policy_gradient_loss | -0.00101 |
| value_loss | 301 |
------------------------------------------
------------------------------------------
| time/ | |
| fps | 428 |
| iterations | 8 |
| time_elapsed | 38 |
| total_timesteps | 16384 |
| train/ | |
| approx_kl | 0.0009479063 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.692 |
| explained_variance | 4.77e-07 |
| learning_rate | 0.0001 |
| loss | 98.8 |
| n_updates | 70 |
| policy_gradient_loss | 0.000127 |
| value_loss | 266 |
------------------------------------------
-------------------------------------------
| time/ | |
| fps | 441 |
| iterations | 9 |
| time_elapsed | 41 |
| total_timesteps | 18432 |
| train/ | |
| approx_kl | 0.00034600854 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.693 |
| explained_variance | 1.79e-07 |
| learning_rate | 0.0001 |
| loss | 168 |
| n_updates | 80 |
| policy_gradient_loss | 0.00011 |
| value_loss | 343 |
-------------------------------------------
------------------------------------------
| time/ | |
| fps | 449 |
| iterations | 10 |
| time_elapsed | 45 |
| total_timesteps | 20480 |
| train/ | |
| approx_kl | 0.0013742361 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.692 |
| explained_variance | 2.38e-07 |
| learning_rate | 0.0001 |
| loss | 133 |
| n_updates | 90 |
| policy_gradient_loss | -1.23e-05 |
| value_loss | 291 |
------------------------------------------
------------------------------------------
| time/ | |
| fps | 451 |
| iterations | 11 |
| time_elapsed | 49 |
| total_timesteps | 22528 |
| train/ | |
| approx_kl | 0.0028755323 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.692 |
| explained_variance | 0 |
| learning_rate | 0.0001 |
| loss | 200 |
| n_updates | 100 |
| policy_gradient_loss | 4.93e-05 |
| value_loss | 291 |
------------------------------------------
------------------------------------------
| time/ | |
| fps | 460 |
| iterations | 12 |
| time_elapsed | 53 |
| total_timesteps | 24576 |
| train/ | |
| approx_kl | 6.388352e-05 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.691 |
| explained_variance | 0 |
| learning_rate | 0.0001 |
| loss | 131 |
| n_updates | 110 |
| policy_gradient_loss | 8.18e-05 |
| value_loss | 291 |
------------------------------------------
-----------------------------------------
| time/ | |
| fps | 468 |
| iterations | 13 |
| time_elapsed | 56 |
| total_timesteps | 26624 |
| train/ | |
| approx_kl | 0.006255144 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.684 |
| explained_variance | 0 |
| learning_rate | 0.0001 |
| loss | 218 |
| n_updates | 120 |
| policy_gradient_loss | -0.00133 |
| value_loss | 370 |
-----------------------------------------
------------------------------------------
| time/ | |
| fps | 467 |
| iterations | 14 |
| time_elapsed | 61 |
| total_timesteps | 28672 |
| train/ | |
| approx_kl | 0.0027648583 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.678 |
| explained_variance | 0 |
| learning_rate | 0.0001 |
| loss | 117 |
| n_updates | 130 |
| policy_gradient_loss | 0.000196 |
| value_loss | 262 |
------------------------------------------
------------------------------------------
| time/ | |
| fps | 473 |
| iterations | 15 |
| time_elapsed | 64 |
| total_timesteps | 30720 |
| train/ | |
| approx_kl | 0.0018706408 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.687 |
| explained_variance | 0 |
| learning_rate | 0.0001 |
| loss | 152 |
| n_updates | 140 |
| policy_gradient_loss | -0.000377 |
| value_loss | 279 |
------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.11e+04 |
| time/ | |
| fps | 478 |
| iterations | 16 |
| time_elapsed | 68 |
| total_timesteps | 32768 |
| train/ | |
| approx_kl | 0.0017815977 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.689 |
| explained_variance | 0 |
| learning_rate | 0.0001 |
| loss | 145 |
| n_updates | 150 |
| policy_gradient_loss | -0.000994 |
| value_loss | 312 |
------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.11e+04 |
| time/ | |
| fps | 483 |
| iterations | 17 |
| time_elapsed | 72 |
| total_timesteps | 34816 |
| train/ | |
| approx_kl | 0.0071223453 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.677 |
| explained_variance | 5.96e-08 |
| learning_rate | 0.0001 |
| loss | 190 |
| n_updates | 160 |
| policy_gradient_loss | -0.00185 |
| value_loss | 381 |
------------------------------------------
-----------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.11e+04 |
| time/ | |
| fps | 481 |
| iterations | 18 |
| time_elapsed | 76 |
| total_timesteps | 36864 |
| train/ | |
| approx_kl | 0.011150025 |
| clip_fraction | 0.00894 |
| clip_range | 0.2 |
| entropy_loss | -0.637 |
| explained_variance | 0 |
| learning_rate | 0.0001 |
| loss | 173 |
| n_updates | 170 |
| policy_gradient_loss | -0.00179 |
| value_loss | 266 |
-----------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.11e+04 |
| time/ | |
| fps | 486 |
| iterations | 19 |
| time_elapsed | 80 |
| total_timesteps | 38912 |
| train/ | |
| approx_kl | 0.0031371156 |
| clip_fraction | 0.0268 |
| clip_range | 0.2 |
| entropy_loss | -0.556 |
| explained_variance | 0 |
| learning_rate | 0.0001 |
| loss | 151 |
| n_updates | 180 |
| policy_gradient_loss | -0.00362 |
| value_loss | 275 |
------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.11e+04 |
| time/ | |
| fps | 490 |
| iterations | 20 |
| time_elapsed | 83 |
| total_timesteps | 40960 |
| train/ | |
| approx_kl | 0.0019332008 |
| clip_fraction | 0.0257 |
| clip_range | 0.2 |
| entropy_loss | -0.594 |
| explained_variance | 0 |
| learning_rate | 0.0001 |
| loss | 110 |
| n_updates | 190 |
| policy_gradient_loss | -0.000733 |
| value_loss | 221 |
------------------------------------------
-----------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.11e+04 |
| time/ | |
| fps | 491 |
| iterations | 21 |
| time_elapsed | 87 |
| total_timesteps | 43008 |
| train/ | |
| approx_kl | 0.003131848 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.562 |
| explained_variance | 0 |
| learning_rate | 0.0001 |
| loss | 130 |
| n_updates | 200 |
| policy_gradient_loss | -0.0001 |
| value_loss | 231 |
-----------------------------------------
-----------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.11e+04 |
| time/ | |
| fps | 492 |
| iterations | 22 |
| time_elapsed | 91 |
| total_timesteps | 45056 |
| train/ | |
| approx_kl | 0.006756487 |
| clip_fraction | 0.044 |
| clip_range | 0.2 |
| entropy_loss | -0.499 |
| explained_variance | 1.19e-07 |
| learning_rate | 0.0001 |
| loss | 118 |
| n_updates | 210 |
| policy_gradient_loss | -0.00516 |
| value_loss | 288 |
-----------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.11e+04 |
| time/ | |
| fps | 496 |
| iterations | 23 |
| time_elapsed | 94 |
| total_timesteps | 47104 |
| train/ | |
| approx_kl | 0.0036652326 |
| clip_fraction | 0.0387 |
| clip_range | 0.2 |
| entropy_loss | -0.431 |
| explained_variance | 2.98e-07 |
| learning_rate | 0.0001 |
| loss | 65.9 |
| n_updates | 220 |
| policy_gradient_loss | -0.0055 |
| value_loss | 150 |
------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.11e+04 |
| time/ | |
| fps | 500 |
| iterations | 24 |
| time_elapsed | 98 |
| total_timesteps | 49152 |
| train/ | |
| approx_kl | 0.0020700884 |
| clip_fraction | 0.0168 |
| clip_range | 0.2 |
| entropy_loss | -0.38 |
| explained_variance | 4.17e-07 |
| learning_rate | 0.0001 |
| loss | 109 |
| n_updates | 230 |
| policy_gradient_loss | -0.00181 |
| value_loss | 225 |
------------------------------------------
-----------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.11e+04 |
| time/ | |
| fps | 500 |
| iterations | 25 |
| time_elapsed | 102 |
| total_timesteps | 51200 |
| train/ | |
| approx_kl | 7.14717e-05 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.373 |
| explained_variance | 1.67e-05 |
| learning_rate | 0.0001 |
| loss | 51.6 |
| n_updates | 240 |
| policy_gradient_loss | 0.000206 |
| value_loss | 153 |
-----------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.11e+04 |
| time/ | |
| fps | 501 |
| iterations | 26 |
| time_elapsed | 106 |
| total_timesteps | 53248 |
| train/ | |
| approx_kl | 0.0019045037 |
| clip_fraction | 0.0169 |
| clip_range | 0.2 |
| entropy_loss | -0.331 |
| explained_variance | -6.91e-06 |
| learning_rate | 0.0001 |
| loss | 75.1 |
| n_updates | 250 |
| policy_gradient_loss | -0.00223 |
| value_loss | 142 |
------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.11e+04 |
| time/ | |
| fps | 503 |
| iterations | 27 |
| time_elapsed | 109 |
| total_timesteps | 55296 |
| train/ | |
| approx_kl | 0.0025663977 |
| clip_fraction | 0.025 |
| clip_range | 0.2 |
| entropy_loss | -0.276 |
| explained_variance | -5.6e-06 |
| learning_rate | 0.0001 |
| loss | 58.6 |
| n_updates | 260 |
| policy_gradient_loss | -0.00302 |
| value_loss | 114 |
------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.11e+04 |
| time/ | |
| fps | 503 |
| iterations | 28 |
| time_elapsed | 113 |
| total_timesteps | 57344 |
| train/ | |
| approx_kl | 0.0009262676 |
| clip_fraction | 0.0133 |
| clip_range | 0.2 |
| entropy_loss | -0.243 |
| explained_variance | -5.72e-06 |
| learning_rate | 0.0001 |
| loss | 55.5 |
| n_updates | 270 |
| policy_gradient_loss | -0.000963 |
| value_loss | 109 |
------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.11e+04 |
| time/ | |
| fps | 501 |
| iterations | 29 |
| time_elapsed | 118 |
| total_timesteps | 59392 |
| train/ | |
| approx_kl | 5.630689e-06 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.233 |
| explained_variance | -1.98e-05 |
| learning_rate | 0.0001 |
| loss | 27.7 |
| n_updates | 280 |
| policy_gradient_loss | -7.46e-05 |
| value_loss | 96.2 |
------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.11e+04 |
| time/ | |
| fps | 504 |
| iterations | 30 |
| time_elapsed | 121 |
| total_timesteps | 61440 |
| train/ | |
| approx_kl | 0.0009388536 |
| clip_fraction | 0.00688 |
| clip_range | 0.2 |
| entropy_loss | -0.216 |
| explained_variance | -7.63e-06 |
| learning_rate | 0.0001 |
| loss | 33.6 |
| n_updates | 290 |
| policy_gradient_loss | -0.000995 |
| value_loss | 92.1 |
------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -8.62e+03 |
| time/ | |
| fps | 507 |
| iterations | 31 |
| time_elapsed | 125 |
| total_timesteps | 63488 |
| train/ | |
| approx_kl | 0.0013099958 |
| clip_fraction | 0.00405 |
| clip_range | 0.2 |
| entropy_loss | -0.185 |
| explained_variance | -1.88e-05 |
| learning_rate | 0.0001 |
| loss | 39.3 |
| n_updates | 300 |
| policy_gradient_loss | -0.000805 |
| value_loss | 82.3 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -8.62e+03 |
| time/ | |
| fps | 508 |
| iterations | 32 |
| time_elapsed | 128 |
| total_timesteps | 65536 |
| train/ | |
| approx_kl | 0.00040734332 |
| clip_fraction | 0.00635 |
| clip_range | 0.2 |
| entropy_loss | -0.154 |
| explained_variance | 8.23e-06 |
| learning_rate | 0.0001 |
| loss | 28.8 |
| n_updates | 310 |
| policy_gradient_loss | -0.000716 |
| value_loss | 61.8 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -8.62e+03 |
| time/ | |
| fps | 507 |
| iterations | 33 |
| time_elapsed | 133 |
| total_timesteps | 67584 |
| train/ | |
| approx_kl | 0.00017335304 |
| clip_fraction | 0.00132 |
| clip_range | 0.2 |
| entropy_loss | -0.163 |
| explained_variance | 0.000185 |
| learning_rate | 0.0001 |
| loss | 24.9 |
| n_updates | 320 |
| policy_gradient_loss | 0.000557 |
| value_loss | 47.6 |
-------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -8.62e+03 |
| time/ | |
| fps | 510 |
| iterations | 34 |
| time_elapsed | 136 |
| total_timesteps | 69632 |
| train/ | |
| approx_kl | 0.0005359481 |
| clip_fraction | 0.00996 |
| clip_range | 0.2 |
| entropy_loss | -0.173 |
| explained_variance | 0.000212 |
| learning_rate | 0.0001 |
| loss | 27.5 |
| n_updates | 330 |
| policy_gradient_loss | -0.000549 |
| value_loss | 83.9 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -8.62e+03 |
| time/ | |
| fps | 512 |
| iterations | 35 |
| time_elapsed | 139 |
| total_timesteps | 71680 |
| train/ | |
| approx_kl | 0.00036608256 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.162 |
| explained_variance | 8.3e-05 |
| learning_rate | 0.0001 |
| loss | 37.5 |
| n_updates | 340 |
| policy_gradient_loss | -0.000265 |
| value_loss | 72.2 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -8.62e+03 |
| time/ | |
| fps | 512 |
| iterations | 36 |
| time_elapsed | 143 |
| total_timesteps | 73728 |
| train/ | |
| approx_kl | 0.00091095583 |
| clip_fraction | 0.0139 |
| clip_range | 0.2 |
| entropy_loss | -0.136 |
| explained_variance | -2.15e-05 |
| learning_rate | 0.0001 |
| loss | 23.9 |
| n_updates | 350 |
| policy_gradient_loss | -0.00256 |
| value_loss | 59.7 |
-------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -8.62e+03 |
| time/ | |
| fps | 512 |
| iterations | 37 |
| time_elapsed | 147 |
| total_timesteps | 75776 |
| train/ | |
| approx_kl | 7.011686e-05 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.131 |
| explained_variance | 0.000318 |
| learning_rate | 0.0001 |
| loss | 33.1 |
| n_updates | 360 |
| policy_gradient_loss | -2.32e-05 |
| value_loss | 59.1 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -8.62e+03 |
| time/ | |
| fps | 513 |
| iterations | 38 |
| time_elapsed | 151 |
| total_timesteps | 77824 |
| train/ | |
| approx_kl | 0.00040162334 |
| clip_fraction | 0.00234 |
| clip_range | 0.2 |
| entropy_loss | -0.111 |
| explained_variance | -0.000234 |
| learning_rate | 0.0001 |
| loss | 32.6 |
| n_updates | 370 |
| policy_gradient_loss | -0.00115 |
| value_loss | 47.4 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -8.62e+03 |
| time/ | |
| fps | 511 |
| iterations | 39 |
| time_elapsed | 156 |
| total_timesteps | 79872 |
| train/ | |
| approx_kl | 3.2372773e-05 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.108 |
| explained_variance | 0.000249 |
| learning_rate | 0.0001 |
| loss | 24.1 |
| n_updates | 380 |
| policy_gradient_loss | 0.000133 |
| value_loss | 58.8 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -8.62e+03 |
| time/ | |
| fps | 510 |
| iterations | 40 |
| time_elapsed | 160 |
| total_timesteps | 81920 |
| train/ | |
| approx_kl | 0.00020054713 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.103 |
| explained_variance | 0.000197 |
| learning_rate | 0.0001 |
| loss | 6.67 |
| n_updates | 390 |
| policy_gradient_loss | -0.00018 |
| value_loss | 49 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -8.62e+03 |
| time/ | |
| fps | 512 |
| iterations | 41 |
| time_elapsed | 163 |
| total_timesteps | 83968 |
| train/ | |
| approx_kl | 0.00045890105 |
| clip_fraction | 0.000928 |
| clip_range | 0.2 |
| entropy_loss | -0.0922 |
| explained_variance | 0.000226 |
| learning_rate | 0.0001 |
| loss | 23.8 |
| n_updates | 400 |
| policy_gradient_loss | 0.000228 |
| value_loss | 42.8 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -8.62e+03 |
| time/ | |
| fps | 514 |
| iterations | 42 |
| time_elapsed | 167 |
| total_timesteps | 86016 |
| train/ | |
| approx_kl | 0.00027388014 |
| clip_fraction | 0.00347 |
| clip_range | 0.2 |
| entropy_loss | -0.093 |
| explained_variance | 0.000348 |
| learning_rate | 0.0001 |
| loss | 25.7 |
| n_updates | 410 |
| policy_gradient_loss | -0.000332 |
| value_loss | 32.2 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -8.62e+03 |
| time/ | |
| fps | 514 |
| iterations | 43 |
| time_elapsed | 171 |
| total_timesteps | 88064 |
| train/ | |
| approx_kl | 0.00026889332 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.1 |
| explained_variance | -0.000166 |
| learning_rate | 0.0001 |
| loss | 9.75 |
| n_updates | 420 |
| policy_gradient_loss | 0.000209 |
| value_loss | 29.1 |
-------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -8.62e+03 |
| time/ | |
| fps | 514 |
| iterations | 44 |
| time_elapsed | 175 |
| total_timesteps | 90112 |
| train/ | |
| approx_kl | 0.0003086823 |
| clip_fraction | 0.00542 |
| clip_range | 0.2 |
| entropy_loss | -0.12 |
| explained_variance | -0.000659 |
| learning_rate | 0.0001 |
| loss | 21.4 |
| n_updates | 430 |
| policy_gradient_loss | -0.00149 |
| value_loss | 54 |
------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -8.62e+03 |
| time/ | |
| fps | 515 |
| iterations | 45 |
| time_elapsed | 178 |
| total_timesteps | 92160 |
| train/ | |
| approx_kl | 0.0004820212 |
| clip_fraction | 0.00928 |
| clip_range | 0.2 |
| entropy_loss | -0.105 |
| explained_variance | 0.000334 |
| learning_rate | 0.0001 |
| loss | 23.9 |
| n_updates | 440 |
| policy_gradient_loss | -0.00161 |
| value_loss | 40.5 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -6.01e+03 |
| time/ | |
| fps | 517 |
| iterations | 46 |
| time_elapsed | 182 |
| total_timesteps | 94208 |
| train/ | |
| approx_kl | 0.00044686458 |
| clip_fraction | 0.00356 |
| clip_range | 0.2 |
| entropy_loss | -0.0928 |
| explained_variance | 0.000714 |
| learning_rate | 0.0001 |
| loss | 15.9 |
| n_updates | 450 |
| policy_gradient_loss | -0.00116 |
| value_loss | 37.5 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -6.01e+03 |
| time/ | |
| fps | 517 |
| iterations | 47 |
| time_elapsed | 186 |
| total_timesteps | 96256 |
| train/ | |
| approx_kl | 0.00030986342 |
| clip_fraction | 0.00186 |
| clip_range | 0.2 |
| entropy_loss | -0.077 |
| explained_variance | -0.000211 |
| learning_rate | 0.0001 |
| loss | 69.5 |
| n_updates | 460 |
| policy_gradient_loss | -0.000178 |
| value_loss | 48.1 |
-------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -6.01e+03 |
| time/ | |
| fps | 517 |
| iterations | 48 |
| time_elapsed | 189 |
| total_timesteps | 98304 |
| train/ | |
| approx_kl | 0.0006204129 |
| clip_fraction | 0.00479 |
| clip_range | 0.2 |
| entropy_loss | -0.065 |
| explained_variance | 9.89e-05 |
| learning_rate | 0.0001 |
| loss | 11.3 |
| n_updates | 470 |
| policy_gradient_loss | -0.0015 |
| value_loss | 30.9 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -6.01e+03 |
| time/ | |
| fps | 519 |
| iterations | 49 |
| time_elapsed | 193 |
| total_timesteps | 100352 |
| train/ | |
| approx_kl | 0.00022342062 |
| clip_fraction | 0.00303 |
| clip_range | 0.2 |
| entropy_loss | -0.0663 |
| explained_variance | -0.000208 |
| learning_rate | 0.0001 |
| loss | 8.59 |
| n_updates | 480 |
| policy_gradient_loss | -0.000975 |
| value_loss | 42.8 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -6.01e+03 |
| time/ | |
| fps | 520 |
| iterations | 50 |
| time_elapsed | 196 |
| total_timesteps | 102400 |
| train/ | |
| approx_kl | 0.00016291966 |
| clip_fraction | 0.000684 |
| clip_range | 0.2 |
| entropy_loss | -0.0786 |
| explained_variance | 0.000286 |
| learning_rate | 0.0001 |
| loss | 19.7 |
| n_updates | 490 |
| policy_gradient_loss | -2.81e-05 |
| value_loss | 33.2 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -6.01e+03 |
| time/ | |
| fps | 519 |
| iterations | 51 |
| time_elapsed | 200 |
| total_timesteps | 104448 |
| train/ | |
| approx_kl | 0.00041238018 |
| clip_fraction | 0.0106 |
| clip_range | 0.2 |
| entropy_loss | -0.0662 |
| explained_variance | 0.000111 |
| learning_rate | 0.0001 |
| loss | 14.8 |
| n_updates | 500 |
| policy_gradient_loss | -0.00318 |
| value_loss | 17.9 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -6.01e+03 |
| time/ | |
| fps | 518 |
| iterations | 52 |
| time_elapsed | 205 |
| total_timesteps | 106496 |
| train/ | |
| approx_kl | 0.00019490853 |
| clip_fraction | 0.00586 |
| clip_range | 0.2 |
| entropy_loss | -0.0762 |
| explained_variance | 5.42e-05 |
| learning_rate | 0.0001 |
| loss | 6.61 |
| n_updates | 510 |
| policy_gradient_loss | -0.00189 |
| value_loss | 19.3 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -6.01e+03 |
| time/ | |
| fps | 519 |
| iterations | 53 |
| time_elapsed | 208 |
| total_timesteps | 108544 |
| train/ | |
| approx_kl | 0.00016265517 |
| clip_fraction | 0.000293 |
| clip_range | 0.2 |
| entropy_loss | -0.0751 |
| explained_variance | 0.000146 |
| learning_rate | 0.0001 |
| loss | 13.7 |
| n_updates | 520 |
| policy_gradient_loss | -7.25e-05 |
| value_loss | 29.7 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -6.01e+03 |
| time/ | |
| fps | 520 |
| iterations | 54 |
| time_elapsed | 212 |
| total_timesteps | 110592 |
| train/ | |
| approx_kl | 0.00035145192 |
| clip_fraction | 0.00503 |
| clip_range | 0.2 |
| entropy_loss | -0.0596 |
| explained_variance | 0.000102 |
| learning_rate | 0.0001 |
| loss | 13.4 |
| n_updates | 530 |
| policy_gradient_loss | -0.00142 |
| value_loss | 30.8 |
-------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -6.01e+03 |
| time/ | |
| fps | 519 |
| iterations | 55 |
| time_elapsed | 216 |
| total_timesteps | 112640 |
| train/ | |
| approx_kl | 0.0001811744 |
| clip_fraction | 0.002 |
| clip_range | 0.2 |
| entropy_loss | -0.0638 |
| explained_variance | 0.000245 |
| learning_rate | 0.0001 |
| loss | 34.1 |
| n_updates | 540 |
| policy_gradient_loss | -7.61e-05 |
| value_loss | 76.5 |
------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -6.01e+03 |
| time/ | |
| fps | 521 |
| iterations | 56 |
| time_elapsed | 219 |
| total_timesteps | 114688 |
| train/ | |
| approx_kl | 2.015548e-05 |
| clip_fraction | 0.00112 |
| clip_range | 0.2 |
| entropy_loss | -0.0708 |
| explained_variance | 0.000174 |
| learning_rate | 0.0001 |
| loss | 19.7 |
| n_updates | 550 |
| policy_gradient_loss | 0.000191 |
| value_loss | 24.7 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -6.01e+03 |
| time/ | |
| fps | 522 |
| iterations | 57 |
| time_elapsed | 223 |
| total_timesteps | 116736 |
| train/ | |
| approx_kl | 0.00022369638 |
| clip_fraction | 0.00483 |
| clip_range | 0.2 |
| entropy_loss | -0.0798 |
| explained_variance | 8.94e-06 |
| learning_rate | 0.0001 |
| loss | 8.14 |
| n_updates | 560 |
| policy_gradient_loss | -0.000364 |
| value_loss | 46.3 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -6.01e+03 |
| time/ | |
| fps | 522 |
| iterations | 58 |
| time_elapsed | 227 |
| total_timesteps | 118784 |
| train/ | |
| approx_kl | 0.00021125973 |
| clip_fraction | 0.000879 |
| clip_range | 0.2 |
| entropy_loss | -0.0837 |
| explained_variance | 4.13e-05 |
| learning_rate | 0.0001 |
| loss | 15.8 |
| n_updates | 570 |
| policy_gradient_loss | 9.68e-05 |
| value_loss | 25.7 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -6.01e+03 |
| time/ | |
| fps | 522 |
| iterations | 59 |
| time_elapsed | 231 |
| total_timesteps | 120832 |
| train/ | |
| approx_kl | 0.00018066258 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.0657 |
| explained_variance | -0.000157 |
| learning_rate | 0.0001 |
| loss | 17 |
| n_updates | 580 |
| policy_gradient_loss | 1.17e-05 |
| value_loss | 24.5 |
-------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -6.01e+03 |
| time/ | |
| fps | 523 |
| iterations | 60 |
| time_elapsed | 234 |
| total_timesteps | 122880 |
| train/ | |
| approx_kl | 9.938053e-05 |
| clip_fraction | 0.00142 |
| clip_range | 0.2 |
| entropy_loss | -0.0538 |
| explained_variance | 0.000101 |
| learning_rate | 0.0001 |
| loss | 10.7 |
| n_updates | 590 |
| policy_gradient_loss | -0.000246 |
| value_loss | 23.1 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -6.01e+03 |
| time/ | |
| fps | 525 |
| iterations | 61 |
| time_elapsed | 237 |
| total_timesteps | 124928 |
| train/ | |
| approx_kl | 0.00018916334 |
| clip_fraction | 0.00303 |
| clip_range | 0.2 |
| entropy_loss | -0.0446 |
| explained_variance | 0.000148 |
| learning_rate | 0.0001 |
| loss | 6.15 |
| n_updates | 600 |
| policy_gradient_loss | -0.00128 |
| value_loss | 27.6 |
-------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -4.62e+03 |
| time/ | |
| fps | 524 |
| iterations | 62 |
| time_elapsed | 242 |
| total_timesteps | 126976 |
| train/ | |
| approx_kl | 4.523777e-05 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.0378 |
| explained_variance | 0.000176 |
| learning_rate | 0.0001 |
| loss | 11.4 |
| n_updates | 610 |
| policy_gradient_loss | -0.000497 |
| value_loss | 18.3 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -4.62e+03 |
| time/ | |
| fps | 525 |
| iterations | 63 |
| time_elapsed | 245 |
| total_timesteps | 129024 |
| train/ | |
| approx_kl | 0.00026768591 |
| clip_fraction | 0.00347 |
| clip_range | 0.2 |
| entropy_loss | -0.0346 |
| explained_variance | 0.000441 |
| learning_rate | 0.0001 |
| loss | 11.5 |
| n_updates | 620 |
| policy_gradient_loss | -0.00139 |
| value_loss | 37.4 |
-------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -4.62e+03 |
| time/ | |
| fps | 526 |
| iterations | 64 |
| time_elapsed | 249 |
| total_timesteps | 131072 |
| train/ | |
| approx_kl | 7.622526e-06 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.0292 |
| explained_variance | -9.62e-05 |
| learning_rate | 0.0001 |
| loss | 6 |
| n_updates | 630 |
| policy_gradient_loss | 0.000175 |
| value_loss | 8.07 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -4.62e+03 |
| time/ | |
| fps | 527 |
| iterations | 65 |
| time_elapsed | 252 |
| total_timesteps | 133120 |
| train/ | |
| approx_kl | 2.6350608e-07 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.0311 |
| explained_variance | 0.000158 |
| learning_rate | 0.0001 |
| loss | 3.18 |
| n_updates | 640 |
| policy_gradient_loss | 0.000106 |
| value_loss | 33 |
-------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -4.62e+03 |
| time/ | |
| fps | 525 |
| iterations | 66 |
| time_elapsed | 257 |
| total_timesteps | 135168 |
| train/ | |
| approx_kl | 9.358092e-05 |
| clip_fraction | 0.00083 |
| clip_range | 0.2 |
| entropy_loss | -0.0255 |
| explained_variance | 0.000128 |
| learning_rate | 0.0001 |
| loss | 1.01 |
| n_updates | 650 |
| policy_gradient_loss | -5.11e-05 |
| value_loss | 6.33 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -4.62e+03 |
| time/ | |
| fps | 526 |
| iterations | 67 |
| time_elapsed | 260 |
| total_timesteps | 137216 |
| train/ | |
| approx_kl | 3.4828554e-06 |
| clip_fraction | 0.000684 |
| clip_range | 0.2 |
| entropy_loss | -0.0273 |
| explained_variance | -0.000236 |
| learning_rate | 0.0001 |
| loss | 7.62 |
| n_updates | 660 |
| policy_gradient_loss | -0.000176 |
| value_loss | 13.6 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -4.62e+03 |
| time/ | |
| fps | 527 |
| iterations | 68 |
| time_elapsed | 263 |
| total_timesteps | 139264 |
| train/ | |
| approx_kl | 6.2496925e-05 |
| clip_fraction | 0.00117 |
| clip_range | 0.2 |
| entropy_loss | -0.0297 |
| explained_variance | 0.000403 |
| learning_rate | 0.0001 |
| loss | 6.63 |
| n_updates | 670 |
| policy_gradient_loss | -0.000472 |
| value_loss | 25.8 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -4.62e+03 |
| time/ | |
| fps | 528 |
| iterations | 69 |
| time_elapsed | 267 |
| total_timesteps | 141312 |
| train/ | |
| approx_kl | 9.1235386e-05 |
| clip_fraction | 0.00249 |
| clip_range | 0.2 |
| entropy_loss | -0.0271 |
| explained_variance | 0.000398 |
| learning_rate | 0.0001 |
| loss | 3.11 |
| n_updates | 680 |
| policy_gradient_loss | -0.00177 |
| value_loss | 9.06 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -4.62e+03 |
| time/ | |
| fps | 527 |
| iterations | 70 |
| time_elapsed | 271 |
| total_timesteps | 143360 |
| train/ | |
| approx_kl | 1.6124366e-05 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.0254 |
| explained_variance | -8.92e-05 |
| learning_rate | 0.0001 |
| loss | 15.6 |
| n_updates | 690 |
| policy_gradient_loss | 0.000162 |
| value_loss | 58.8 |
-------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -4.62e+03 |
| time/ | |
| fps | 528 |
| iterations | 71 |
| time_elapsed | 275 |
| total_timesteps | 145408 |
| train/ | |
| approx_kl | 7.662439e-05 |
| clip_fraction | 0.00132 |
| clip_range | 0.2 |
| entropy_loss | -0.0272 |
| explained_variance | -0.000148 |
| learning_rate | 0.0001 |
| loss | 1.59 |
| n_updates | 700 |
| policy_gradient_loss | -0.00053 |
| value_loss | 16.8 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -4.62e+03 |
| time/ | |
| fps | 529 |
| iterations | 72 |
| time_elapsed | 278 |
| total_timesteps | 147456 |
| train/ | |
| approx_kl | 0.00022278022 |
| clip_fraction | 0.00356 |
| clip_range | 0.2 |
| entropy_loss | -0.0312 |
| explained_variance | -0.000168 |
| learning_rate | 0.0001 |
| loss | 17.1 |
| n_updates | 710 |
| policy_gradient_loss | -0.00114 |
| value_loss | 39.8 |
-------------------------------------------
--------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -4.62e+03 |
| time/ | |
| fps | 529 |
| iterations | 73 |
| time_elapsed | 282 |
| total_timesteps | 149504 |
| train/ | |
| approx_kl | 0.000118300784 |
| clip_fraction | 0.00132 |
| clip_range | 0.2 |
| entropy_loss | -0.0281 |
| explained_variance | 2.26e-05 |
| learning_rate | 0.0001 |
| loss | 7.17 |
| n_updates | 720 |
| policy_gradient_loss | -0.00115 |
| value_loss | 13.8 |
--------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -4.62e+03 |
| time/ | |
| fps | 528 |
| iterations | 74 |
| time_elapsed | 286 |
| total_timesteps | 151552 |
| train/ | |
| approx_kl | 0.00014248138 |
| clip_fraction | 0.000488 |
| clip_range | 0.2 |
| entropy_loss | -0.0254 |
| explained_variance | 2.13e-05 |
| learning_rate | 0.0001 |
| loss | 7.64 |
| n_updates | 730 |
| policy_gradient_loss | -1.77e-05 |
| value_loss | 29.7 |
-------------------------------------------
--------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -4.62e+03 |
| time/ | |
| fps | 529 |
| iterations | 75 |
| time_elapsed | 290 |
| total_timesteps | 153600 |
| train/ | |
| approx_kl | 0.000110096385 |
| clip_fraction | 0.00181 |
| clip_range | 0.2 |
| entropy_loss | -0.0253 |
| explained_variance | 0.000126 |
| learning_rate | 0.0001 |
| loss | 21.8 |
| n_updates | 740 |
| policy_gradient_loss | -0.000213 |
| value_loss | 37.2 |
--------------------------------------------
--------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -4.62e+03 |
| time/ | |
| fps | 530 |
| iterations | 76 |
| time_elapsed | 293 |
| total_timesteps | 155648 |
| train/ | |
| approx_kl | 0.000121777906 |
| clip_fraction | 0.00273 |
| clip_range | 0.2 |
| entropy_loss | -0.0207 |
| explained_variance | -0.000448 |
| learning_rate | 0.0001 |
| loss | 8.28 |
| n_updates | 750 |
| policy_gradient_loss | -0.00263 |
| value_loss | 20.6 |
--------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.72e+03 |
| time/ | |
| fps | 529 |
| iterations | 77 |
| time_elapsed | 297 |
| total_timesteps | 157696 |
| train/ | |
| approx_kl | 6.8619556e-05 |
| clip_fraction | 0.000195 |
| clip_range | 0.2 |
| entropy_loss | -0.0181 |
| explained_variance | -0.000112 |
| learning_rate | 0.0001 |
| loss | 7.98 |
| n_updates | 760 |
| policy_gradient_loss | -0.000476 |
| value_loss | 22.8 |
-------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.72e+03 |
| time/ | |
| fps | 529 |
| iterations | 78 |
| time_elapsed | 301 |
| total_timesteps | 159744 |
| train/ | |
| approx_kl | 9.807342e-05 |
| clip_fraction | 0.0019 |
| clip_range | 0.2 |
| entropy_loss | -0.0138 |
| explained_variance | -5.25e-06 |
| learning_rate | 0.0001 |
| loss | 15.6 |
| n_updates | 770 |
| policy_gradient_loss | -0.000991 |
| value_loss | 17.9 |
------------------------------------------
-----------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.72e+03 |
| time/ | |
| fps | 530 |
| iterations | 79 |
| time_elapsed | 304 |
| total_timesteps | 161792 |
| train/ | |
| approx_kl | 9.18142e-05 |
| clip_fraction | 0.00146 |
| clip_range | 0.2 |
| entropy_loss | -0.0106 |
| explained_variance | -3.22e-05 |
| learning_rate | 0.0001 |
| loss | 0.29 |
| n_updates | 780 |
| policy_gradient_loss | -0.0015 |
| value_loss | 4.17 |
-----------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.72e+03 |
| time/ | |
| fps | 531 |
| iterations | 80 |
| time_elapsed | 308 |
| total_timesteps | 163840 |
| train/ | |
| approx_kl | 1.1527591e-05 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00961 |
| explained_variance | -3.52e-05 |
| learning_rate | 0.0001 |
| loss | 6.74 |
| n_updates | 790 |
| policy_gradient_loss | 0.000161 |
| value_loss | 38.3 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.72e+03 |
| time/ | |
| fps | 530 |
| iterations | 81 |
| time_elapsed | 312 |
| total_timesteps | 165888 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00841 |
| explained_variance | -0.0101 |
| learning_rate | 0.0001 |
| loss | 2.51e-05 |
| n_updates | 800 |
| policy_gradient_loss | -1.46e-06 |
| value_loss | 0.00084 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.72e+03 |
| time/ | |
| fps | 531 |
| iterations | 82 |
| time_elapsed | 316 |
| total_timesteps | 167936 |
| train/ | |
| approx_kl | 4.3276086e-05 |
| clip_fraction | 0.000391 |
| clip_range | 0.2 |
| entropy_loss | -0.00715 |
| explained_variance | -4.86e-05 |
| learning_rate | 0.0001 |
| loss | 4.71 |
| n_updates | 810 |
| policy_gradient_loss | -0.000819 |
| value_loss | 14.5 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.72e+03 |
| time/ | |
| fps | 532 |
| iterations | 83 |
| time_elapsed | 319 |
| total_timesteps | 169984 |
| train/ | |
| approx_kl | 4.4569257e-05 |
| clip_fraction | 0.00083 |
| clip_range | 0.2 |
| entropy_loss | -0.00736 |
| explained_variance | 1.37e-05 |
| learning_rate | 0.0001 |
| loss | 11.4 |
| n_updates | 820 |
| policy_gradient_loss | -0.00104 |
| value_loss | 42.1 |
-------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.72e+03 |
| time/ | |
| fps | 532 |
| iterations | 84 |
| time_elapsed | 322 |
| total_timesteps | 172032 |
| train/ | |
| approx_kl | 2.831046e-05 |
| clip_fraction | 0.000684 |
| clip_range | 0.2 |
| entropy_loss | -0.00664 |
| explained_variance | 2.97e-05 |
| learning_rate | 0.0001 |
| loss | 1.17 |
| n_updates | 830 |
| policy_gradient_loss | -0.00103 |
| value_loss | 8.13 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.72e+03 |
| time/ | |
| fps | 531 |
| iterations | 85 |
| time_elapsed | 327 |
| total_timesteps | 174080 |
| train/ | |
| approx_kl | 1.9206054e-05 |
| clip_fraction | 0.000293 |
| clip_range | 0.2 |
| entropy_loss | -0.00529 |
| explained_variance | -0.000153 |
| learning_rate | 0.0001 |
| loss | 0.0166 |
| n_updates | 840 |
| policy_gradient_loss | -0.000511 |
| value_loss | 2.89 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.72e+03 |
| time/ | |
| fps | 532 |
| iterations | 86 |
| time_elapsed | 330 |
| total_timesteps | 176128 |
| train/ | |
| approx_kl | 2.6202237e-05 |
| clip_fraction | 0.000586 |
| clip_range | 0.2 |
| entropy_loss | -0.00562 |
| explained_variance | 9.54e-07 |
| learning_rate | 0.0001 |
| loss | 10.8 |
| n_updates | 850 |
| policy_gradient_loss | -0.000853 |
| value_loss | 42.8 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.72e+03 |
| time/ | |
| fps | 533 |
| iterations | 87 |
| time_elapsed | 334 |
| total_timesteps | 178176 |
| train/ | |
| approx_kl | 3.6602287e-05 |
| clip_fraction | 0.000732 |
| clip_range | 0.2 |
| entropy_loss | -0.00648 |
| explained_variance | -2.83e-05 |
| learning_rate | 0.0001 |
| loss | 7.2 |
| n_updates | 860 |
| policy_gradient_loss | 0.000327 |
| value_loss | 17.5 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.72e+03 |
| time/ | |
| fps | 533 |
| iterations | 88 |
| time_elapsed | 337 |
| total_timesteps | 180224 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00551 |
| explained_variance | -0.021 |
| learning_rate | 0.0001 |
| loss | 5.43e-05 |
| n_updates | 870 |
| policy_gradient_loss | 7.97e-07 |
| value_loss | 0.000302 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.72e+03 |
| time/ | |
| fps | 532 |
| iterations | 89 |
| time_elapsed | 342 |
| total_timesteps | 182272 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00553 |
| explained_variance | 0.0466 |
| learning_rate | 0.0001 |
| loss | 1.15e-05 |
| n_updates | 880 |
| policy_gradient_loss | 6.93e-08 |
| value_loss | 0.000127 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.72e+03 |
| time/ | |
| fps | 532 |
| iterations | 90 |
| time_elapsed | 345 |
| total_timesteps | 184320 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00553 |
| explained_variance | -0.0568 |
| learning_rate | 0.0001 |
| loss | 5.32e-06 |
| n_updates | 890 |
| policy_gradient_loss | -2.85e-07 |
| value_loss | 8.77e-05 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.72e+03 |
| time/ | |
| fps | 532 |
| iterations | 91 |
| time_elapsed | 350 |
| total_timesteps | 186368 |
| train/ | |
| approx_kl | 1.2390607e-05 |
| clip_fraction | 0.000391 |
| clip_range | 0.2 |
| entropy_loss | -0.00652 |
| explained_variance | -3.46e-06 |
| learning_rate | 0.0001 |
| loss | 1.75 |
| n_updates | 900 |
| policy_gradient_loss | -0.000633 |
| value_loss | 5.38 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.11e+03 |
| time/ | |
| fps | 531 |
| iterations | 92 |
| time_elapsed | 354 |
| total_timesteps | 188416 |
| train/ | |
| approx_kl | 1.8498919e-05 |
| clip_fraction | 0.000195 |
| clip_range | 0.2 |
| entropy_loss | -0.00572 |
| explained_variance | -1.51e-05 |
| learning_rate | 0.0001 |
| loss | 28.6 |
| n_updates | 910 |
| policy_gradient_loss | -0.000442 |
| value_loss | 17.8 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.11e+03 |
| time/ | |
| fps | 531 |
| iterations | 93 |
| time_elapsed | 358 |
| total_timesteps | 190464 |
| train/ | |
| approx_kl | 0.00013042055 |
| clip_fraction | 0.00122 |
| clip_range | 0.2 |
| entropy_loss | -0.0037 |
| explained_variance | 1.51e-05 |
| learning_rate | 0.0001 |
| loss | 16.1 |
| n_updates | 920 |
| policy_gradient_loss | -0.00135 |
| value_loss | 38.5 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.11e+03 |
| time/ | |
| fps | 532 |
| iterations | 94 |
| time_elapsed | 361 |
| total_timesteps | 192512 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00348 |
| explained_variance | -0.00777 |
| learning_rate | 0.0001 |
| loss | 9.08e-06 |
| n_updates | 930 |
| policy_gradient_loss | 3.2e-06 |
| value_loss | 0.000537 |
---------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.11e+03 |
| time/ | |
| fps | 532 |
| iterations | 95 |
| time_elapsed | 365 |
| total_timesteps | 194560 |
| train/ | |
| approx_kl | 8.359726e-06 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00375 |
| explained_variance | -2.42e-05 |
| learning_rate | 0.0001 |
| loss | 16.1 |
| n_updates | 940 |
| policy_gradient_loss | 0.000125 |
| value_loss | 33.4 |
------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.11e+03 |
| time/ | |
| fps | 531 |
| iterations | 96 |
| time_elapsed | 369 |
| total_timesteps | 196608 |
| train/ | |
| approx_kl | 9.765092e-06 |
| clip_fraction | 0.000439 |
| clip_range | 0.2 |
| entropy_loss | -0.00482 |
| explained_variance | 2.38e-06 |
| learning_rate | 0.0001 |
| loss | 4.48 |
| n_updates | 950 |
| policy_gradient_loss | -0.000694 |
| value_loss | 4.44 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.11e+03 |
| time/ | |
| fps | 532 |
| iterations | 97 |
| time_elapsed | 373 |
| total_timesteps | 198656 |
| train/ | |
| approx_kl | 1.2091157e-05 |
| clip_fraction | 0.000439 |
| clip_range | 0.2 |
| entropy_loss | -0.00404 |
| explained_variance | 7.64e-05 |
| learning_rate | 0.0001 |
| loss | 0.00801 |
| n_updates | 960 |
| policy_gradient_loss | -0.000611 |
| value_loss | 0.338 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.11e+03 |
| time/ | |
| fps | 532 |
| iterations | 98 |
| time_elapsed | 376 |
| total_timesteps | 200704 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00396 |
| explained_variance | 0.0551 |
| learning_rate | 0.0001 |
| loss | 7.84e-06 |
| n_updates | 970 |
| policy_gradient_loss | -4.12e-07 |
| value_loss | 0.000134 |
---------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.11e+03 |
| time/ | |
| fps | 532 |
| iterations | 99 |
| time_elapsed | 380 |
| total_timesteps | 202752 |
| train/ | |
| approx_kl | 9.217649e-06 |
| clip_fraction | 0.000195 |
| clip_range | 0.2 |
| entropy_loss | -0.00448 |
| explained_variance | 9.66e-06 |
| learning_rate | 0.0001 |
| loss | 105 |
| n_updates | 980 |
| policy_gradient_loss | -0.000495 |
| value_loss | 45.2 |
------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.11e+03 |
| time/ | |
| fps | 532 |
| iterations | 100 |
| time_elapsed | 384 |
| total_timesteps | 204800 |
| train/ | |
| approx_kl | 2.997581e-05 |
| clip_fraction | 0.000781 |
| clip_range | 0.2 |
| entropy_loss | -0.00382 |
| explained_variance | -1.81e-05 |
| learning_rate | 0.0001 |
| loss | 5.74 |
| n_updates | 990 |
| policy_gradient_loss | -0.00113 |
| value_loss | 15.2 |
------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.11e+03 |
| time/ | |
| fps | 532 |
| iterations | 101 |
| time_elapsed | 388 |
| total_timesteps | 206848 |
| train/ | |
| approx_kl | 8.873583e-06 |
| clip_fraction | 0.000293 |
| clip_range | 0.2 |
| entropy_loss | -0.00423 |
| explained_variance | 3.87e-06 |
| learning_rate | 0.0001 |
| loss | 0.148 |
| n_updates | 1000 |
| policy_gradient_loss | -0.000538 |
| value_loss | 51.6 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.11e+03 |
| time/ | |
| fps | 533 |
| iterations | 102 |
| time_elapsed | 391 |
| total_timesteps | 208896 |
| train/ | |
| approx_kl | 2.9613933e-05 |
| clip_fraction | 0.000195 |
| clip_range | 0.2 |
| entropy_loss | -0.00436 |
| explained_variance | -4.41e-06 |
| learning_rate | 0.0001 |
| loss | 5.16 |
| n_updates | 1010 |
| policy_gradient_loss | 0.000246 |
| value_loss | 11.2 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.11e+03 |
| time/ | |
| fps | 532 |
| iterations | 103 |
| time_elapsed | 395 |
| total_timesteps | 210944 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.0038 |
| explained_variance | 0.082 |
| learning_rate | 0.0001 |
| loss | 7.71e-08 |
| n_updates | 1020 |
| policy_gradient_loss | 1.76e-07 |
| value_loss | 6.02e-06 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.11e+03 |
| time/ | |
| fps | 533 |
| iterations | 104 |
| time_elapsed | 399 |
| total_timesteps | 212992 |
| train/ | |
| approx_kl | 2.2429507e-05 |
| clip_fraction | 0.000586 |
| clip_range | 0.2 |
| entropy_loss | -0.00316 |
| explained_variance | -7.87e-06 |
| learning_rate | 0.0001 |
| loss | 3.89 |
| n_updates | 1030 |
| policy_gradient_loss | 0.000369 |
| value_loss | 18.9 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.11e+03 |
| time/ | |
| fps | 533 |
| iterations | 105 |
| time_elapsed | 402 |
| total_timesteps | 215040 |
| train/ | |
| approx_kl | 2.9865594e-05 |
| clip_fraction | 0.000391 |
| clip_range | 0.2 |
| entropy_loss | -0.00411 |
| explained_variance | -2.63e-05 |
| learning_rate | 0.0001 |
| loss | 0.000919 |
| n_updates | 1040 |
| policy_gradient_loss | -0.000625 |
| value_loss | 0.247 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.11e+03 |
| time/ | |
| fps | 534 |
| iterations | 106 |
| time_elapsed | 406 |
| total_timesteps | 217088 |
| train/ | |
| approx_kl | 1.9396684e-05 |
| clip_fraction | 0.000244 |
| clip_range | 0.2 |
| entropy_loss | -0.00374 |
| explained_variance | -7.99e-06 |
| learning_rate | 0.0001 |
| loss | 0.416 |
| n_updates | 1050 |
| policy_gradient_loss | 0.000371 |
| value_loss | 35.2 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -3.11e+03 |
| time/ | |
| fps | 533 |
| iterations | 107 |
| time_elapsed | 410 |
| total_timesteps | 219136 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00348 |
| explained_variance | 0.0128 |
| learning_rate | 0.0001 |
| loss | 7.13e-05 |
| n_updates | 1060 |
| policy_gradient_loss | 6.91e-07 |
| value_loss | 0.00018 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.63e+03 |
| time/ | |
| fps | 533 |
| iterations | 108 |
| time_elapsed | 414 |
| total_timesteps | 221184 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00348 |
| explained_variance | 0.0058 |
| learning_rate | 0.0001 |
| loss | 3.94e-05 |
| n_updates | 1070 |
| policy_gradient_loss | 3.14e-06 |
| value_loss | 0.000123 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.63e+03 |
| time/ | |
| fps | 534 |
| iterations | 109 |
| time_elapsed | 417 |
| total_timesteps | 223232 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00348 |
| explained_variance | 6.2e-06 |
| learning_rate | 0.0001 |
| loss | 5.39e-05 |
| n_updates | 1080 |
| policy_gradient_loss | -4.02e-07 |
| value_loss | 0.0987 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.63e+03 |
| time/ | |
| fps | 534 |
| iterations | 110 |
| time_elapsed | 421 |
| total_timesteps | 225280 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00348 |
| explained_variance | -0.0453 |
| learning_rate | 0.0001 |
| loss | 2.58e-06 |
| n_updates | 1090 |
| policy_gradient_loss | -3.97e-07 |
| value_loss | 6.13e-05 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.63e+03 |
| time/ | |
| fps | 533 |
| iterations | 111 |
| time_elapsed | 425 |
| total_timesteps | 227328 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00348 |
| explained_variance | 0.00567 |
| learning_rate | 0.0001 |
| loss | 6.26e-06 |
| n_updates | 1100 |
| policy_gradient_loss | 6.6e-07 |
| value_loss | 4.09e-05 |
---------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.63e+03 |
| time/ | |
| fps | 534 |
| iterations | 112 |
| time_elapsed | 429 |
| total_timesteps | 229376 |
| train/ | |
| approx_kl | 2.179583e-05 |
| clip_fraction | 0.000342 |
| clip_range | 0.2 |
| entropy_loss | -0.00278 |
| explained_variance | -6.79e-06 |
| learning_rate | 0.0001 |
| loss | 1.15 |
| n_updates | 1110 |
| policy_gradient_loss | -0.000553 |
| value_loss | 4.14 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.63e+03 |
| time/ | |
| fps | 534 |
| iterations | 113 |
| time_elapsed | 432 |
| total_timesteps | 231424 |
| train/ | |
| approx_kl | 2.3180211e-05 |
| clip_fraction | 0.000439 |
| clip_range | 0.2 |
| entropy_loss | -0.00201 |
| explained_variance | -8.94e-06 |
| learning_rate | 0.0001 |
| loss | 0.0328 |
| n_updates | 1120 |
| policy_gradient_loss | -0.0006 |
| value_loss | 0.0504 |
-------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.63e+03 |
| time/ | |
| fps | 534 |
| iterations | 114 |
| time_elapsed | 436 |
| total_timesteps | 233472 |
| train/ | |
| approx_kl | 1.079822e-05 |
| clip_fraction | 4.88e-05 |
| clip_range | 0.2 |
| entropy_loss | -0.00213 |
| explained_variance | -1.19e-06 |
| learning_rate | 0.0001 |
| loss | 0.423 |
| n_updates | 1130 |
| policy_gradient_loss | -0.000264 |
| value_loss | 77.9 |
------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.63e+03 |
| time/ | |
| fps | 534 |
| iterations | 115 |
| time_elapsed | 440 |
| total_timesteps | 235520 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00238 |
| explained_variance | 0.00169 |
| learning_rate | 0.0001 |
| loss | 0.000121 |
| n_updates | 1140 |
| policy_gradient_loss | -2.68e-06 |
| value_loss | 0.000329 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.63e+03 |
| time/ | |
| fps | 534 |
| iterations | 116 |
| time_elapsed | 444 |
| total_timesteps | 237568 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00238 |
| explained_variance | -9.66e-05 |
| learning_rate | 0.0001 |
| loss | 1.64e-05 |
| n_updates | 1150 |
| policy_gradient_loss | -9.3e-07 |
| value_loss | 0.000263 |
---------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.63e+03 |
| time/ | |
| fps | 535 |
| iterations | 117 |
| time_elapsed | 447 |
| total_timesteps | 239616 |
| train/ | |
| approx_kl | 2.443156e-05 |
| clip_fraction | 0.000342 |
| clip_range | 0.2 |
| entropy_loss | -0.00296 |
| explained_variance | -3.81e-06 |
| learning_rate | 0.0001 |
| loss | 4.75 |
| n_updates | 1160 |
| policy_gradient_loss | -0.000584 |
| value_loss | 22.6 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.63e+03 |
| time/ | |
| fps | 534 |
| iterations | 118 |
| time_elapsed | 451 |
| total_timesteps | 241664 |
| train/ | |
| approx_kl | 1.8200604e-05 |
| clip_fraction | 0.000293 |
| clip_range | 0.2 |
| entropy_loss | -0.00368 |
| explained_variance | -5.96e-06 |
| learning_rate | 0.0001 |
| loss | 7.59 |
| n_updates | 1170 |
| policy_gradient_loss | -0.000514 |
| value_loss | 33.2 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.63e+03 |
| time/ | |
| fps | 535 |
| iterations | 119 |
| time_elapsed | 455 |
| total_timesteps | 243712 |
| train/ | |
| approx_kl | 1.8162827e-05 |
| clip_fraction | 0.000391 |
| clip_range | 0.2 |
| entropy_loss | -0.00317 |
| explained_variance | -1.19e-06 |
| learning_rate | 0.0001 |
| loss | 1.29 |
| n_updates | 1180 |
| policy_gradient_loss | -0.000612 |
| value_loss | 5.76 |
-------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.63e+03 |
| time/ | |
| fps | 535 |
| iterations | 120 |
| time_elapsed | 458 |
| total_timesteps | 245760 |
| train/ | |
| approx_kl | 2.746706e-05 |
| clip_fraction | 0.000342 |
| clip_range | 0.2 |
| entropy_loss | -0.00236 |
| explained_variance | -8.34e-07 |
| learning_rate | 0.0001 |
| loss | 0.0459 |
| n_updates | 1190 |
| policy_gradient_loss | -0.000599 |
| value_loss | 9.55 |
------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.63e+03 |
| time/ | |
| fps | 536 |
| iterations | 121 |
| time_elapsed | 462 |
| total_timesteps | 247808 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00221 |
| explained_variance | -0.000457 |
| learning_rate | 0.0001 |
| loss | 1.89e-06 |
| n_updates | 1200 |
| policy_gradient_loss | -2.97e-06 |
| value_loss | 0.000128 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.63e+03 |
| time/ | |
| fps | 535 |
| iterations | 122 |
| time_elapsed | 466 |
| total_timesteps | 249856 |
| train/ | |
| approx_kl | 1.5505997e-05 |
| clip_fraction | 0.000342 |
| clip_range | 0.2 |
| entropy_loss | -0.00265 |
| explained_variance | 1.01e-06 |
| learning_rate | 0.0001 |
| loss | 1.04e-05 |
| n_updates | 1210 |
| policy_gradient_loss | -0.000612 |
| value_loss | 5.67 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.29e+03 |
| time/ | |
| fps | 535 |
| iterations | 123 |
| time_elapsed | 470 |
| total_timesteps | 251904 |
| train/ | |
| approx_kl | 4.3872802e-05 |
| clip_fraction | 0.00127 |
| clip_range | 0.2 |
| entropy_loss | -0.00224 |
| explained_variance | -1.55e-06 |
| learning_rate | 0.0001 |
| loss | 7.49 |
| n_updates | 1220 |
| policy_gradient_loss | -0.00159 |
| value_loss | 44.3 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.29e+03 |
| time/ | |
| fps | 536 |
| iterations | 124 |
| time_elapsed | 473 |
| total_timesteps | 253952 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00219 |
| explained_variance | 1.75e-05 |
| learning_rate | 0.0001 |
| loss | -4.98e-05 |
| n_updates | 1230 |
| policy_gradient_loss | 2.31e-07 |
| value_loss | 0.0986 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.29e+03 |
| time/ | |
| fps | 536 |
| iterations | 125 |
| time_elapsed | 477 |
| total_timesteps | 256000 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00219 |
| explained_variance | -0.00422 |
| learning_rate | 0.0001 |
| loss | 6.09e-06 |
| n_updates | 1240 |
| policy_gradient_loss | 3.4e-06 |
| value_loss | 0.000124 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.29e+03 |
| time/ | |
| fps | 535 |
| iterations | 126 |
| time_elapsed | 481 |
| total_timesteps | 258048 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00219 |
| explained_variance | 0.00276 |
| learning_rate | 0.0001 |
| loss | 1.19e-05 |
| n_updates | 1250 |
| policy_gradient_loss | 5.08e-06 |
| value_loss | 8.57e-05 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.29e+03 |
| time/ | |
| fps | 536 |
| iterations | 127 |
| time_elapsed | 484 |
| total_timesteps | 260096 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00219 |
| explained_variance | 0.00352 |
| learning_rate | 0.0001 |
| loss | 2.97e-06 |
| n_updates | 1260 |
| policy_gradient_loss | -4.91e-06 |
| value_loss | 5.83e-05 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.29e+03 |
| time/ | |
| fps | 536 |
| iterations | 128 |
| time_elapsed | 488 |
| total_timesteps | 262144 |
| train/ | |
| approx_kl | 2.8885232e-05 |
| clip_fraction | 0.000293 |
| clip_range | 0.2 |
| entropy_loss | -0.00183 |
| explained_variance | 2.38e-06 |
| learning_rate | 0.0001 |
| loss | 0.111 |
| n_updates | 1270 |
| policy_gradient_loss | -0.000383 |
| value_loss | 10.2 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.29e+03 |
| time/ | |
| fps | 536 |
| iterations | 129 |
| time_elapsed | 492 |
| total_timesteps | 264192 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00174 |
| explained_variance | -0.00383 |
| learning_rate | 0.0001 |
| loss | 4.07e-06 |
| n_updates | 1280 |
| policy_gradient_loss | -9.82e-07 |
| value_loss | 0.000245 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.29e+03 |
| time/ | |
| fps | 536 |
| iterations | 130 |
| time_elapsed | 496 |
| total_timesteps | 266240 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00174 |
| explained_variance | 0.00681 |
| learning_rate | 0.0001 |
| loss | 3.78e-06 |
| n_updates | 1290 |
| policy_gradient_loss | 1.75e-06 |
| value_loss | 0.000153 |
---------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.29e+03 |
| time/ | |
| fps | 537 |
| iterations | 131 |
| time_elapsed | 499 |
| total_timesteps | 268288 |
| train/ | |
| approx_kl | 9.014853e-06 |
| clip_fraction | 9.77e-05 |
| clip_range | 0.2 |
| entropy_loss | -0.0019 |
| explained_variance | 4.17e-06 |
| learning_rate | 0.0001 |
| loss | 140 |
| n_updates | 1300 |
| policy_gradient_loss | -0.000358 |
| value_loss | 158 |
------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.29e+03 |
| time/ | |
| fps | 537 |
| iterations | 132 |
| time_elapsed | 502 |
| total_timesteps | 270336 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00205 |
| explained_variance | -0.00247 |
| learning_rate | 0.0001 |
| loss | 9.06e-05 |
| n_updates | 1310 |
| policy_gradient_loss | 5.54e-07 |
| value_loss | 0.000582 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.29e+03 |
| time/ | |
| fps | 531 |
| iterations | 133 |
| time_elapsed | 512 |
| total_timesteps | 272384 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00205 |
| explained_variance | 0.0221 |
| learning_rate | 0.0001 |
| loss | 1.36e-07 |
| n_updates | 1320 |
| policy_gradient_loss | 1.5e-06 |
| value_loss | 0.000363 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.29e+03 |
| time/ | |
| fps | 530 |
| iterations | 134 |
| time_elapsed | 516 |
| total_timesteps | 274432 |
| train/ | |
| approx_kl | 2.1562562e-05 |
| clip_fraction | 0.000293 |
| clip_range | 0.2 |
| entropy_loss | -0.00247 |
| explained_variance | 3.7e-06 |
| learning_rate | 0.0001 |
| loss | 43.4 |
| n_updates | 1330 |
| policy_gradient_loss | -0.000509 |
| value_loss | 30.8 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.29e+03 |
| time/ | |
| fps | 530 |
| iterations | 135 |
| time_elapsed | 520 |
| total_timesteps | 276480 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00264 |
| explained_variance | 0.000519 |
| learning_rate | 0.0001 |
| loss | 2.12e-05 |
| n_updates | 1340 |
| policy_gradient_loss | -8.6e-07 |
| value_loss | 0.000548 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.29e+03 |
| time/ | |
| fps | 530 |
| iterations | 136 |
| time_elapsed | 524 |
| total_timesteps | 278528 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00264 |
| explained_variance | 0.0192 |
| learning_rate | 0.0001 |
| loss | 1.09e-06 |
| n_updates | 1350 |
| policy_gradient_loss | -1.58e-06 |
| value_loss | 0.000307 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.29e+03 |
| time/ | |
| fps | 531 |
| iterations | 137 |
| time_elapsed | 528 |
| total_timesteps | 280576 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00264 |
| explained_variance | -0.0352 |
| learning_rate | 0.0001 |
| loss | 4.29e-05 |
| n_updates | 1360 |
| policy_gradient_loss | 4.13e-07 |
| value_loss | 0.000206 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.02e+03 |
| time/ | |
| fps | 531 |
| iterations | 138 |
| time_elapsed | 531 |
| total_timesteps | 282624 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00264 |
| explained_variance | -0.0582 |
| learning_rate | 0.0001 |
| loss | 2.93e-05 |
| n_updates | 1370 |
| policy_gradient_loss | -8.94e-07 |
| value_loss | 0.000142 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.02e+03 |
| time/ | |
| fps | 531 |
| iterations | 139 |
| time_elapsed | 535 |
| total_timesteps | 284672 |
| train/ | |
| approx_kl | 1.8537685e-05 |
| clip_fraction | 0.000293 |
| clip_range | 0.2 |
| entropy_loss | -0.00218 |
| explained_variance | -2.48e-05 |
| learning_rate | 0.0001 |
| loss | 72.3 |
| n_updates | 1380 |
| policy_gradient_loss | -0.000534 |
| value_loss | 72.3 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.02e+03 |
| time/ | |
| fps | 531 |
| iterations | 140 |
| time_elapsed | 539 |
| total_timesteps | 286720 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00204 |
| explained_variance | 0.0059 |
| learning_rate | 0.0001 |
| loss | 2.41e-06 |
| n_updates | 1390 |
| policy_gradient_loss | -3.58e-08 |
| value_loss | 0.000259 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.02e+03 |
| time/ | |
| fps | 531 |
| iterations | 141 |
| time_elapsed | 543 |
| total_timesteps | 288768 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00204 |
| explained_variance | -0.00394 |
| learning_rate | 0.0001 |
| loss | 2.88e-05 |
| n_updates | 1400 |
| policy_gradient_loss | -1.27e-06 |
| value_loss | 8.54e-05 |
---------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.02e+03 |
| time/ | |
| fps | 531 |
| iterations | 142 |
| time_elapsed | 547 |
| total_timesteps | 290816 |
| train/ | |
| approx_kl | 9.242317e-05 |
| clip_fraction | 0.000928 |
| clip_range | 0.2 |
| entropy_loss | -0.00287 |
| explained_variance | -8.34e-07 |
| learning_rate | 0.0001 |
| loss | 0.0184 |
| n_updates | 1410 |
| policy_gradient_loss | -0.00124 |
| value_loss | 1.38 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.02e+03 |
| time/ | |
| fps | 531 |
| iterations | 143 |
| time_elapsed | 551 |
| total_timesteps | 292864 |
| train/ | |
| approx_kl | 3.8562983e-05 |
| clip_fraction | 0.00083 |
| clip_range | 0.2 |
| entropy_loss | -0.0023 |
| explained_variance | 2.8e-06 |
| learning_rate | 0.0001 |
| loss | 11.1 |
| n_updates | 1420 |
| policy_gradient_loss | -0.00119 |
| value_loss | 8.54 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.02e+03 |
| time/ | |
| fps | 531 |
| iterations | 144 |
| time_elapsed | 554 |
| total_timesteps | 294912 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00225 |
| explained_variance | -0.0106 |
| learning_rate | 0.0001 |
| loss | 7.02e-06 |
| n_updates | 1430 |
| policy_gradient_loss | -6.11e-07 |
| value_loss | 0.000228 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.02e+03 |
| time/ | |
| fps | 532 |
| iterations | 145 |
| time_elapsed | 558 |
| total_timesteps | 296960 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00225 |
| explained_variance | 0.005 |
| learning_rate | 0.0001 |
| loss | 3.61e-05 |
| n_updates | 1440 |
| policy_gradient_loss | 1.7e-06 |
| value_loss | 0.000163 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.02e+03 |
| time/ | |
| fps | 532 |
| iterations | 146 |
| time_elapsed | 561 |
| total_timesteps | 299008 |
| train/ | |
| approx_kl | 1.4764315e-05 |
| clip_fraction | 0.000342 |
| clip_range | 0.2 |
| entropy_loss | -0.00267 |
| explained_variance | 2.98e-07 |
| learning_rate | 0.0001 |
| loss | 0.164 |
| n_updates | 1450 |
| policy_gradient_loss | -0.000569 |
| value_loss | 35.8 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.02e+03 |
| time/ | |
| fps | 531 |
| iterations | 147 |
| time_elapsed | 566 |
| total_timesteps | 301056 |
| train/ | |
| approx_kl | 2.2976543e-05 |
| clip_fraction | 0.000879 |
| clip_range | 0.2 |
| entropy_loss | -0.00329 |
| explained_variance | 0 |
| learning_rate | 0.0001 |
| loss | 74.4 |
| n_updates | 1460 |
| policy_gradient_loss | -0.00112 |
| value_loss | 47.3 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.02e+03 |
| time/ | |
| fps | 532 |
| iterations | 148 |
| time_elapsed | 569 |
| total_timesteps | 303104 |
| train/ | |
| approx_kl | 1.2882199e-05 |
| clip_fraction | 0.000391 |
| clip_range | 0.2 |
| entropy_loss | -0.00279 |
| explained_variance | -2.86e-06 |
| learning_rate | 0.0001 |
| loss | 0.000269 |
| n_updates | 1470 |
| policy_gradient_loss | -0.000593 |
| value_loss | 6.48 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.02e+03 |
| time/ | |
| fps | 532 |
| iterations | 149 |
| time_elapsed | 572 |
| total_timesteps | 305152 |
| train/ | |
| approx_kl | 4.3345906e-05 |
| clip_fraction | 0.000391 |
| clip_range | 0.2 |
| entropy_loss | -0.00373 |
| explained_variance | -1.29e-05 |
| learning_rate | 0.0001 |
| loss | 0.917 |
| n_updates | 1480 |
| policy_gradient_loss | -0.000601 |
| value_loss | 3.14 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.02e+03 |
| time/ | |
| fps | 532 |
| iterations | 150 |
| time_elapsed | 576 |
| total_timesteps | 307200 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00384 |
| explained_variance | -0.000253 |
| learning_rate | 0.0001 |
| loss | -2.65e-06 |
| n_updates | 1490 |
| policy_gradient_loss | 2.2e-08 |
| value_loss | 0.00066 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.02e+03 |
| time/ | |
| fps | 532 |
| iterations | 151 |
| time_elapsed | 580 |
| total_timesteps | 309248 |
| train/ | |
| approx_kl | 5.9643295e-05 |
| clip_fraction | 0.000928 |
| clip_range | 0.2 |
| entropy_loss | -0.00508 |
| explained_variance | 6.26e-06 |
| learning_rate | 0.0001 |
| loss | 0.534 |
| n_updates | 1500 |
| policy_gradient_loss | -0.00111 |
| value_loss | 30.9 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -2.02e+03 |
| time/ | |
| fps | 533 |
| iterations | 152 |
| time_elapsed | 583 |
| total_timesteps | 311296 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00515 |
| explained_variance | -0.00753 |
| learning_rate | 0.0001 |
| loss | 0.000366 |
| n_updates | 1510 |
| policy_gradient_loss | -2.84e-06 |
| value_loss | 0.00148 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.82e+03 |
| time/ | |
| fps | 533 |
| iterations | 153 |
| time_elapsed | 587 |
| total_timesteps | 313344 |
| train/ | |
| approx_kl | 5.2308955e-05 |
| clip_fraction | 0.000781 |
| clip_range | 0.2 |
| entropy_loss | -0.00394 |
| explained_variance | -2.74e-06 |
| learning_rate | 0.0001 |
| loss | 3.64 |
| n_updates | 1520 |
| policy_gradient_loss | -0.00112 |
| value_loss | 37.3 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.82e+03 |
| time/ | |
| fps | 533 |
| iterations | 154 |
| time_elapsed | 591 |
| total_timesteps | 315392 |
| train/ | |
| approx_kl | 2.7183676e-05 |
| clip_fraction | 0.000342 |
| clip_range | 0.2 |
| entropy_loss | -0.00297 |
| explained_variance | -5.84e-06 |
| learning_rate | 0.0001 |
| loss | 0.217 |
| n_updates | 1530 |
| policy_gradient_loss | -0.000593 |
| value_loss | 16.6 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.82e+03 |
| time/ | |
| fps | 533 |
| iterations | 155 |
| time_elapsed | 594 |
| total_timesteps | 317440 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00277 |
| explained_variance | 0.00399 |
| learning_rate | 0.0001 |
| loss | -4.42e-06 |
| n_updates | 1540 |
| policy_gradient_loss | -6.62e-07 |
| value_loss | 0.000162 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.82e+03 |
| time/ | |
| fps | 533 |
| iterations | 156 |
| time_elapsed | 598 |
| total_timesteps | 319488 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00277 |
| explained_variance | -0.0092 |
| learning_rate | 0.0001 |
| loss | 8.37e-07 |
| n_updates | 1550 |
| policy_gradient_loss | 1.42e-06 |
| value_loss | 0.000112 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.82e+03 |
| time/ | |
| fps | 534 |
| iterations | 157 |
| time_elapsed | 601 |
| total_timesteps | 321536 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00277 |
| explained_variance | 0.00293 |
| learning_rate | 0.0001 |
| loss | 1.16e-05 |
| n_updates | 1560 |
| policy_gradient_loss | -1.94e-06 |
| value_loss | 7.46e-05 |
---------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.82e+03 |
| time/ | |
| fps | 533 |
| iterations | 158 |
| time_elapsed | 606 |
| total_timesteps | 323584 |
| train/ | |
| approx_kl | 3.205097e-05 |
| clip_fraction | 0.000439 |
| clip_range | 0.2 |
| entropy_loss | -0.00206 |
| explained_variance | 2.74e-06 |
| learning_rate | 0.0001 |
| loss | 0.000272 |
| n_updates | 1570 |
| policy_gradient_loss | -0.000645 |
| value_loss | 0.933 |
------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.82e+03 |
| time/ | |
| fps | 534 |
| iterations | 159 |
| time_elapsed | 609 |
| total_timesteps | 325632 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00196 |
| explained_variance | -0.00765 |
| learning_rate | 0.0001 |
| loss | 1.62e-06 |
| n_updates | 1580 |
| policy_gradient_loss | 1.14e-06 |
| value_loss | 4.4e-05 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.82e+03 |
| time/ | |
| fps | 534 |
| iterations | 160 |
| time_elapsed | 612 |
| total_timesteps | 327680 |
| train/ | |
| approx_kl | 1.4959427e-05 |
| clip_fraction | 0.000244 |
| clip_range | 0.2 |
| entropy_loss | -0.00226 |
| explained_variance | -1.31e-06 |
| learning_rate | 0.0001 |
| loss | 0.3 |
| n_updates | 1590 |
| policy_gradient_loss | -0.000458 |
| value_loss | 42.4 |
-------------------------------------------
--------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.82e+03 |
| time/ | |
| fps | 534 |
| iterations | 161 |
| time_elapsed | 616 |
| total_timesteps | 329728 |
| train/ | |
| approx_kl | 1.23280915e-05 |
| clip_fraction | 0.000391 |
| clip_range | 0.2 |
| entropy_loss | -0.00202 |
| explained_variance | -2.38e-07 |
| learning_rate | 0.0001 |
| loss | 0.0169 |
| n_updates | 1600 |
| policy_gradient_loss | -0.000642 |
| value_loss | 5.44 |
--------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.82e+03 |
| time/ | |
| fps | 534 |
| iterations | 162 |
| time_elapsed | 620 |
| total_timesteps | 331776 |
| train/ | |
| approx_kl | 9.287556e-06 |
| clip_fraction | 0.000195 |
| clip_range | 0.2 |
| entropy_loss | -0.00222 |
| explained_variance | 4.17e-07 |
| learning_rate | 0.0001 |
| loss | 137 |
| n_updates | 1610 |
| policy_gradient_loss | -0.000442 |
| value_loss | 91.3 |
------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.82e+03 |
| time/ | |
| fps | 535 |
| iterations | 163 |
| time_elapsed | 623 |
| total_timesteps | 333824 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00233 |
| explained_variance | -0.000614 |
| learning_rate | 0.0001 |
| loss | 3.44e-05 |
| n_updates | 1620 |
| policy_gradient_loss | -2.58e-06 |
| value_loss | 0.000603 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.82e+03 |
| time/ | |
| fps | 535 |
| iterations | 164 |
| time_elapsed | 627 |
| total_timesteps | 335872 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00233 |
| explained_variance | 0.0193 |
| learning_rate | 0.0001 |
| loss | 1.06e-05 |
| n_updates | 1630 |
| policy_gradient_loss | 3.18e-06 |
| value_loss | 0.000369 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.82e+03 |
| time/ | |
| fps | 535 |
| iterations | 165 |
| time_elapsed | 631 |
| total_timesteps | 337920 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00233 |
| explained_variance | -0.0067 |
| learning_rate | 0.0001 |
| loss | 3.22e-05 |
| n_updates | 1640 |
| policy_gradient_loss | 7.41e-07 |
| value_loss | 0.000253 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.82e+03 |
| time/ | |
| fps | 535 |
| iterations | 166 |
| time_elapsed | 634 |
| total_timesteps | 339968 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00233 |
| explained_variance | 0.0204 |
| learning_rate | 0.0001 |
| loss | 2.68e-05 |
| n_updates | 1650 |
| policy_gradient_loss | -2.7e-06 |
| value_loss | 0.000169 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.82e+03 |
| time/ | |
| fps | 535 |
| iterations | 167 |
| time_elapsed | 638 |
| total_timesteps | 342016 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00233 |
| explained_variance | -0.00187 |
| learning_rate | 0.0001 |
| loss | 1.31e-05 |
| n_updates | 1660 |
| policy_gradient_loss | 3.85e-06 |
| value_loss | 0.00012 |
---------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.82e+03 |
| time/ | |
| fps | 536 |
| iterations | 168 |
| time_elapsed | 641 |
| total_timesteps | 344064 |
| train/ | |
| approx_kl | 3.119529e-05 |
| clip_fraction | 0.000439 |
| clip_range | 0.2 |
| entropy_loss | -0.00307 |
| explained_variance | 1.1e-05 |
| learning_rate | 0.0001 |
| loss | 0.145 |
| n_updates | 1670 |
| policy_gradient_loss | -0.000616 |
| value_loss | 0.763 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.64e+03 |
| time/ | |
| fps | 535 |
| iterations | 169 |
| time_elapsed | 645 |
| total_timesteps | 346112 |
| train/ | |
| approx_kl | 4.6224188e-05 |
| clip_fraction | 0.00083 |
| clip_range | 0.2 |
| entropy_loss | -0.00243 |
| explained_variance | -2.98e-06 |
| learning_rate | 0.0001 |
| loss | 6.02 |
| n_updates | 1680 |
| policy_gradient_loss | -0.00104 |
| value_loss | 17.8 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.64e+03 |
| time/ | |
| fps | 536 |
| iterations | 170 |
| time_elapsed | 649 |
| total_timesteps | 348160 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00236 |
| explained_variance | 1.79e-07 |
| learning_rate | 0.0001 |
| loss | 8.69e-06 |
| n_updates | 1690 |
| policy_gradient_loss | -1.11e-06 |
| value_loss | 0.0985 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.64e+03 |
| time/ | |
| fps | 536 |
| iterations | 171 |
| time_elapsed | 652 |
| total_timesteps | 350208 |
| train/ | |
| approx_kl | 2.7465954e-05 |
| clip_fraction | 0.000391 |
| clip_range | 0.2 |
| entropy_loss | -0.0018 |
| explained_variance | 5.36e-07 |
| learning_rate | 0.0001 |
| loss | 27.3 |
| n_updates | 1700 |
| policy_gradient_loss | -0.000603 |
| value_loss | 25.1 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.64e+03 |
| time/ | |
| fps | 536 |
| iterations | 172 |
| time_elapsed | 656 |
| total_timesteps | 352256 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00172 |
| explained_variance | 0.000345 |
| learning_rate | 0.0001 |
| loss | 0.000121 |
| n_updates | 1710 |
| policy_gradient_loss | 2.76e-06 |
| value_loss | 0.00019 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.64e+03 |
| time/ | |
| fps | 536 |
| iterations | 173 |
| time_elapsed | 660 |
| total_timesteps | 354304 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00172 |
| explained_variance | 0.000648 |
| learning_rate | 0.0001 |
| loss | -4.19e-05 |
| n_updates | 1720 |
| policy_gradient_loss | 9.38e-07 |
| value_loss | 0.000154 |
---------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.64e+03 |
| time/ | |
| fps | 536 |
| iterations | 174 |
| time_elapsed | 664 |
| total_timesteps | 356352 |
| train/ | |
| approx_kl | 6.543053e-05 |
| clip_fraction | 0.000391 |
| clip_range | 0.2 |
| entropy_loss | -0.00252 |
| explained_variance | 1.37e-06 |
| learning_rate | 0.0001 |
| loss | 21.6 |
| n_updates | 1730 |
| policy_gradient_loss | -0.000657 |
| value_loss | 16.6 |
------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.64e+03 |
| time/ | |
| fps | 536 |
| iterations | 175 |
| time_elapsed | 667 |
| total_timesteps | 358400 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00263 |
| explained_variance | 0.00698 |
| learning_rate | 0.0001 |
| loss | -1.17e-07 |
| n_updates | 1740 |
| policy_gradient_loss | 1.25e-06 |
| value_loss | 5.32e-06 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.64e+03 |
| time/ | |
| fps | 536 |
| iterations | 176 |
| time_elapsed | 671 |
| total_timesteps | 360448 |
| train/ | |
| approx_kl | 1.4514662e-05 |
| clip_fraction | 0.000439 |
| clip_range | 0.2 |
| entropy_loss | -0.00319 |
| explained_variance | -2.38e-06 |
| learning_rate | 0.0001 |
| loss | 6.69 |
| n_updates | 1750 |
| policy_gradient_loss | -0.000651 |
| value_loss | 11.9 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.64e+03 |
| time/ | |
| fps | 536 |
| iterations | 177 |
| time_elapsed | 675 |
| total_timesteps | 362496 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00324 |
| explained_variance | 0.00225 |
| learning_rate | 0.0001 |
| loss | -4.6e-05 |
| n_updates | 1760 |
| policy_gradient_loss | 5.42e-07 |
| value_loss | 9.61e-05 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.64e+03 |
| time/ | |
| fps | 536 |
| iterations | 178 |
| time_elapsed | 679 |
| total_timesteps | 364544 |
| train/ | |
| approx_kl | 2.5826332e-05 |
| clip_fraction | 0.000342 |
| clip_range | 0.2 |
| entropy_loss | -0.00406 |
| explained_variance | 2.98e-07 |
| learning_rate | 0.0001 |
| loss | 59.5 |
| n_updates | 1770 |
| policy_gradient_loss | -0.000601 |
| value_loss | 110 |
-------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.64e+03 |
| time/ | |
| fps | 537 |
| iterations | 179 |
| time_elapsed | 682 |
| total_timesteps | 366592 |
| train/ | |
| approx_kl | 4.826288e-06 |
| clip_fraction | 0.000342 |
| clip_range | 0.2 |
| entropy_loss | -0.0049 |
| explained_variance | 2.98e-07 |
| learning_rate | 0.0001 |
| loss | 3.72 |
| n_updates | 1780 |
| policy_gradient_loss | 0.000376 |
| value_loss | 4.62 |
------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.64e+03 |
| time/ | |
| fps | 536 |
| iterations | 180 |
| time_elapsed | 686 |
| total_timesteps | 368640 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00457 |
| explained_variance | -0.000362 |
| learning_rate | 0.0001 |
| loss | 1.09e-05 |
| n_updates | 1790 |
| policy_gradient_loss | 2.38e-06 |
| value_loss | 0.000477 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.64e+03 |
| time/ | |
| fps | 536 |
| iterations | 181 |
| time_elapsed | 690 |
| total_timesteps | 370688 |
| train/ | |
| approx_kl | 1.4492078e-05 |
| clip_fraction | 0.000391 |
| clip_range | 0.2 |
| entropy_loss | -0.00547 |
| explained_variance | -1.07e-06 |
| learning_rate | 0.0001 |
| loss | 0.0886 |
| n_updates | 1800 |
| policy_gradient_loss | -0.0006 |
| value_loss | 1.68 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.64e+03 |
| time/ | |
| fps | 537 |
| iterations | 182 |
| time_elapsed | 694 |
| total_timesteps | 372736 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00561 |
| explained_variance | 0.000345 |
| learning_rate | 0.0001 |
| loss | 1.78e-05 |
| n_updates | 1810 |
| policy_gradient_loss | 2.3e-06 |
| value_loss | 0.000431 |
---------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.64e+03 |
| time/ | |
| fps | 537 |
| iterations | 183 |
| time_elapsed | 697 |
| total_timesteps | 374784 |
| train/ | |
| approx_kl | 1.821114e-05 |
| clip_fraction | 0.000293 |
| clip_range | 0.2 |
| entropy_loss | -0.00666 |
| explained_variance | 1.19e-06 |
| learning_rate | 0.0001 |
| loss | 0.103 |
| n_updates | 1820 |
| policy_gradient_loss | -0.000563 |
| value_loss | 17.7 |
------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.49e+03 |
| time/ | |
| fps | 536 |
| iterations | 184 |
| time_elapsed | 702 |
| total_timesteps | 376832 |
| train/ | |
| approx_kl | 3.706559e-05 |
| clip_fraction | 0.000732 |
| clip_range | 0.2 |
| entropy_loss | -0.00573 |
| explained_variance | -1.55e-06 |
| learning_rate | 0.0001 |
| loss | 0.292 |
| n_updates | 1830 |
| policy_gradient_loss | -0.00108 |
| value_loss | 38.9 |
------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.49e+03 |
| time/ | |
| fps | 536 |
| iterations | 185 |
| time_elapsed | 705 |
| total_timesteps | 378880 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00548 |
| explained_variance | 3.93e-06 |
| learning_rate | 0.0001 |
| loss | 4.15e-06 |
| n_updates | 1840 |
| policy_gradient_loss | -2.55e-08 |
| value_loss | 0.0985 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.49e+03 |
| time/ | |
| fps | 537 |
| iterations | 186 |
| time_elapsed | 709 |
| total_timesteps | 380928 |
| train/ | |
| approx_kl | 1.4825695e-05 |
| clip_fraction | 0.000342 |
| clip_range | 0.2 |
| entropy_loss | -0.00454 |
| explained_variance | -2.38e-07 |
| learning_rate | 0.0001 |
| loss | 0.979 |
| n_updates | 1850 |
| policy_gradient_loss | -0.000546 |
| value_loss | 7.9 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.49e+03 |
| time/ | |
| fps | 536 |
| iterations | 187 |
| time_elapsed | 713 |
| total_timesteps | 382976 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00437 |
| explained_variance | -0.000916 |
| learning_rate | 0.0001 |
| loss | 4.13e-06 |
| n_updates | 1860 |
| policy_gradient_loss | 8.82e-07 |
| value_loss | 2.76e-05 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.49e+03 |
| time/ | |
| fps | 536 |
| iterations | 188 |
| time_elapsed | 717 |
| total_timesteps | 385024 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00437 |
| explained_variance | 0.00101 |
| learning_rate | 0.0001 |
| loss | -1.53e-06 |
| n_updates | 1870 |
| policy_gradient_loss | 1.18e-05 |
| value_loss | 2.06e-05 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.49e+03 |
| time/ | |
| fps | 536 |
| iterations | 189 |
| time_elapsed | 721 |
| total_timesteps | 387072 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00437 |
| explained_variance | -0.00293 |
| learning_rate | 0.0001 |
| loss | 2.6e-06 |
| n_updates | 1880 |
| policy_gradient_loss | 1.29e-07 |
| value_loss | 1.38e-05 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.49e+03 |
| time/ | |
| fps | 536 |
| iterations | 190 |
| time_elapsed | 724 |
| total_timesteps | 389120 |
| train/ | |
| approx_kl | 6.7442015e-05 |
| clip_fraction | 0.00083 |
| clip_range | 0.2 |
| entropy_loss | -0.00573 |
| explained_variance | -1.19e-06 |
| learning_rate | 0.0001 |
| loss | 15.1 |
| n_updates | 1890 |
| policy_gradient_loss | -0.00118 |
| value_loss | 28.3 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.49e+03 |
| time/ | |
| fps | 536 |
| iterations | 191 |
| time_elapsed | 729 |
| total_timesteps | 391168 |
| train/ | |
| approx_kl | 2.5089976e-05 |
| clip_fraction | 0.000439 |
| clip_range | 0.2 |
| entropy_loss | -0.00527 |
| explained_variance | -2.38e-07 |
| learning_rate | 0.0001 |
| loss | 46.9 |
| n_updates | 1900 |
| policy_gradient_loss | 0.000191 |
| value_loss | 43.3 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.49e+03 |
| time/ | |
| fps | 536 |
| iterations | 192 |
| time_elapsed | 733 |
| total_timesteps | 393216 |
| train/ | |
| approx_kl | 1.7539423e-05 |
| clip_fraction | 0.000146 |
| clip_range | 0.2 |
| entropy_loss | -0.00412 |
| explained_variance | -8.34e-07 |
| learning_rate | 0.0001 |
| loss | 36.9 |
| n_updates | 1910 |
| policy_gradient_loss | -0.000479 |
| value_loss | 34.6 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.49e+03 |
| time/ | |
| fps | 536 |
| iterations | 193 |
| time_elapsed | 737 |
| total_timesteps | 395264 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00372 |
| explained_variance | 0.0017 |
| learning_rate | 0.0001 |
| loss | 3.3e-06 |
| n_updates | 1920 |
| policy_gradient_loss | 2.35e-06 |
| value_loss | 4.06e-05 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.49e+03 |
| time/ | |
| fps | 535 |
| iterations | 194 |
| time_elapsed | 741 |
| total_timesteps | 397312 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00372 |
| explained_variance | 0.00269 |
| learning_rate | 0.0001 |
| loss | 1.85e-06 |
| n_updates | 1930 |
| policy_gradient_loss | 7.11e-07 |
| value_loss | 5.31e-06 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.49e+03 |
| time/ | |
| fps | 535 |
| iterations | 195 |
| time_elapsed | 745 |
| total_timesteps | 399360 |
| train/ | |
| approx_kl | 1.8064398e-05 |
| clip_fraction | 0.000195 |
| clip_range | 0.2 |
| entropy_loss | -0.0043 |
| explained_variance | -1.19e-07 |
| learning_rate | 0.0001 |
| loss | 2.96 |
| n_updates | 1940 |
| policy_gradient_loss | -0.000466 |
| value_loss | 142 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.49e+03 |
| time/ | |
| fps | 535 |
| iterations | 196 |
| time_elapsed | 748 |
| total_timesteps | 401408 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00468 |
| explained_variance | -0.00415 |
| learning_rate | 0.0001 |
| loss | 1.1e-05 |
| n_updates | 1950 |
| policy_gradient_loss | 4.35e-06 |
| value_loss | 0.00103 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.49e+03 |
| time/ | |
| fps | 536 |
| iterations | 197 |
| time_elapsed | 752 |
| total_timesteps | 403456 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00468 |
| explained_variance | 0.0101 |
| learning_rate | 0.0001 |
| loss | 2.08e-05 |
| n_updates | 1960 |
| policy_gradient_loss | 1.99e-06 |
| value_loss | 0.000855 |
---------------------------------------
--------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.49e+03 |
| time/ | |
| fps | 535 |
| iterations | 198 |
| time_elapsed | 756 |
| total_timesteps | 405504 |
| train/ | |
| approx_kl | 1.08265085e-05 |
| clip_fraction | 0.000391 |
| clip_range | 0.2 |
| entropy_loss | -0.00551 |
| explained_variance | 3.7e-06 |
| learning_rate | 0.0001 |
| loss | 43.3 |
| n_updates | 1970 |
| policy_gradient_loss | -0.000571 |
| value_loss | 29.9 |
--------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.37e+03 |
| time/ | |
| fps | 536 |
| iterations | 199 |
| time_elapsed | 760 |
| total_timesteps | 407552 |
| train/ | |
| approx_kl | 0.00017001922 |
| clip_fraction | 0.00166 |
| clip_range | 0.2 |
| entropy_loss | -0.00398 |
| explained_variance | -9.06e-06 |
| learning_rate | 0.0001 |
| loss | 37.8 |
| n_updates | 1980 |
| policy_gradient_loss | -0.00156 |
| value_loss | 30.6 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.37e+03 |
| time/ | |
| fps | 536 |
| iterations | 200 |
| time_elapsed | 763 |
| total_timesteps | 409600 |
| train/ | |
| approx_kl | 0.00010934289 |
| clip_fraction | 0.000781 |
| clip_range | 0.2 |
| entropy_loss | -0.00258 |
| explained_variance | 1.31e-06 |
| learning_rate | 0.0001 |
| loss | 0.443 |
| n_updates | 1990 |
| policy_gradient_loss | -0.00105 |
| value_loss | 32.5 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.37e+03 |
| time/ | |
| fps | 536 |
| iterations | 201 |
| time_elapsed | 767 |
| total_timesteps | 411648 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00239 |
| explained_variance | 0.000204 |
| learning_rate | 0.0001 |
| loss | 7.31e-07 |
| n_updates | 2000 |
| policy_gradient_loss | -2.77e-06 |
| value_loss | 0.000232 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.37e+03 |
| time/ | |
| fps | 536 |
| iterations | 202 |
| time_elapsed | 771 |
| total_timesteps | 413696 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00239 |
| explained_variance | -0.000301 |
| learning_rate | 0.0001 |
| loss | -2.2e-07 |
| n_updates | 2010 |
| policy_gradient_loss | 7.38e-06 |
| value_loss | 0.000149 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.37e+03 |
| time/ | |
| fps | 536 |
| iterations | 203 |
| time_elapsed | 774 |
| total_timesteps | 415744 |
| train/ | |
| approx_kl | 1.1383498e-05 |
| clip_fraction | 0.000342 |
| clip_range | 0.2 |
| entropy_loss | -0.00278 |
| explained_variance | 4.17e-07 |
| learning_rate | 0.0001 |
| loss | 0.449 |
| n_updates | 2020 |
| policy_gradient_loss | -0.000586 |
| value_loss | 7.9 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.37e+03 |
| time/ | |
| fps | 536 |
| iterations | 204 |
| time_elapsed | 778 |
| total_timesteps | 417792 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00287 |
| explained_variance | 0.0112 |
| learning_rate | 0.0001 |
| loss | 3.25e-07 |
| n_updates | 2030 |
| policy_gradient_loss | -2.34e-07 |
| value_loss | 4.25e-06 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.37e+03 |
| time/ | |
| fps | 536 |
| iterations | 205 |
| time_elapsed | 782 |
| total_timesteps | 419840 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00287 |
| explained_variance | -0.0606 |
| learning_rate | 0.0001 |
| loss | -5.71e-07 |
| n_updates | 2040 |
| policy_gradient_loss | 1.03e-06 |
| value_loss | 9.75e-07 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.37e+03 |
| time/ | |
| fps | 536 |
| iterations | 206 |
| time_elapsed | 785 |
| total_timesteps | 421888 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00287 |
| explained_variance | 0.0198 |
| learning_rate | 0.0001 |
| loss | -8.28e-07 |
| n_updates | 2050 |
| policy_gradient_loss | 5.9e-08 |
| value_loss | 6.56e-07 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.37e+03 |
| time/ | |
| fps | 537 |
| iterations | 207 |
| time_elapsed | 789 |
| total_timesteps | 423936 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00287 |
| explained_variance | -0.0132 |
| learning_rate | 0.0001 |
| loss | 7.65e-07 |
| n_updates | 2060 |
| policy_gradient_loss | -5.79e-07 |
| value_loss | 4.49e-07 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.37e+03 |
| time/ | |
| fps | 537 |
| iterations | 208 |
| time_elapsed | 792 |
| total_timesteps | 425984 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00287 |
| explained_variance | -0.0392 |
| learning_rate | 0.0001 |
| loss | 6.12e-08 |
| n_updates | 2070 |
| policy_gradient_loss | -1.29e-07 |
| value_loss | 3.07e-07 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.37e+03 |
| time/ | |
| fps | 537 |
| iterations | 209 |
| time_elapsed | 796 |
| total_timesteps | 428032 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00287 |
| explained_variance | 0.00864 |
| learning_rate | 0.0001 |
| loss | 9.23e-06 |
| n_updates | 2080 |
| policy_gradient_loss | 8.09e-07 |
| value_loss | 2.08e-07 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.37e+03 |
| time/ | |
| fps | 537 |
| iterations | 210 |
| time_elapsed | 800 |
| total_timesteps | 430080 |
| train/ | |
| approx_kl | 1.9878382e-05 |
| clip_fraction | 0.000342 |
| clip_range | 0.2 |
| entropy_loss | -0.00351 |
| explained_variance | 3.58e-07 |
| learning_rate | 0.0001 |
| loss | 449 |
| n_updates | 2090 |
| policy_gradient_loss | -0.000586 |
| value_loss | 233 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.37e+03 |
| time/ | |
| fps | 537 |
| iterations | 211 |
| time_elapsed | 803 |
| total_timesteps | 432128 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00366 |
| explained_variance | -0.00172 |
| learning_rate | 0.0001 |
| loss | 8.65e-06 |
| n_updates | 2100 |
| policy_gradient_loss | -5.76e-06 |
| value_loss | 0.00165 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.37e+03 |
| time/ | |
| fps | 538 |
| iterations | 212 |
| time_elapsed | 806 |
| total_timesteps | 434176 |
| train/ | |
| approx_kl | 3.5479752e-05 |
| clip_fraction | 0.000439 |
| clip_range | 0.2 |
| entropy_loss | -0.00265 |
| explained_variance | -2.72e-05 |
| learning_rate | 0.0001 |
| loss | 0.000888 |
| n_updates | 2110 |
| policy_gradient_loss | -0.000651 |
| value_loss | 0.35 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.37e+03 |
| time/ | |
| fps | 537 |
| iterations | 213 |
| time_elapsed | 811 |
| total_timesteps | 436224 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00255 |
| explained_variance | -0.00118 |
| learning_rate | 0.0001 |
| loss | 0.000174 |
| n_updates | 2120 |
| policy_gradient_loss | 3.15e-06 |
| value_loss | 0.000572 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.37e+03 |
| time/ | |
| fps | 538 |
| iterations | 214 |
| time_elapsed | 814 |
| total_timesteps | 438272 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00255 |
| explained_variance | -0.00662 |
| learning_rate | 0.0001 |
| loss | -3.8e-06 |
| n_updates | 2130 |
| policy_gradient_loss | 1.41e-07 |
| value_loss | 0.000397 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.26e+03 |
| time/ | |
| fps | 538 |
| iterations | 215 |
| time_elapsed | 817 |
| total_timesteps | 440320 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00255 |
| explained_variance | -0.00302 |
| learning_rate | 0.0001 |
| loss | 6.7e-05 |
| n_updates | 2140 |
| policy_gradient_loss | 1.03e-06 |
| value_loss | 0.000258 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.26e+03 |
| time/ | |
| fps | 538 |
| iterations | 216 |
| time_elapsed | 821 |
| total_timesteps | 442368 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00255 |
| explained_variance | 4.77e-07 |
| learning_rate | 0.0001 |
| loss | 3.4e-05 |
| n_updates | 2150 |
| policy_gradient_loss | 4.18e-07 |
| value_loss | 0.0988 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.26e+03 |
| time/ | |
| fps | 538 |
| iterations | 217 |
| time_elapsed | 825 |
| total_timesteps | 444416 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00255 |
| explained_variance | 0.0185 |
| learning_rate | 0.0001 |
| loss | 1.15e-06 |
| n_updates | 2160 |
| policy_gradient_loss | -6.59e-07 |
| value_loss | 0.000123 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.26e+03 |
| time/ | |
| fps | 538 |
| iterations | 218 |
| time_elapsed | 829 |
| total_timesteps | 446464 |
| train/ | |
| approx_kl | 2.6549445e-05 |
| clip_fraction | 0.000439 |
| clip_range | 0.2 |
| entropy_loss | -0.00329 |
| explained_variance | -6.91e-06 |
| learning_rate | 0.0001 |
| loss | 0.469 |
| n_updates | 2170 |
| policy_gradient_loss | -0.000642 |
| value_loss | 2.75 |
-------------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.26e+03 |
| time/ | |
| fps | 538 |
| iterations | 219 |
| time_elapsed | 832 |
| total_timesteps | 448512 |
| train/ | |
| approx_kl | 6.779103e-05 |
| clip_fraction | 0.00137 |
| clip_range | 0.2 |
| entropy_loss | -0.00258 |
| explained_variance | -4.05e-06 |
| learning_rate | 0.0001 |
| loss | 1.21 |
| n_updates | 2180 |
| policy_gradient_loss | -0.00165 |
| value_loss | 2.78 |
------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.26e+03 |
| time/ | |
| fps | 538 |
| iterations | 220 |
| time_elapsed | 836 |
| total_timesteps | 450560 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00253 |
| explained_variance | 0.0309 |
| learning_rate | 0.0001 |
| loss | 1.97e-06 |
| n_updates | 2190 |
| policy_gradient_loss | 7.58e-07 |
| value_loss | 2.14e-05 |
---------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.26e+03 |
| time/ | |
| fps | 538 |
| iterations | 221 |
| time_elapsed | 840 |
| total_timesteps | 452608 |
| train/ | |
| approx_kl | 1.986482e-05 |
| clip_fraction | 0.000293 |
| clip_range | 0.2 |
| entropy_loss | -0.00304 |
| explained_variance | 2.38e-07 |
| learning_rate | 0.0001 |
| loss | 86.4 |
| n_updates | 2200 |
| policy_gradient_loss | -0.000568 |
| value_loss | 65.5 |
------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.26e+03 |
| time/ | |
| fps | 538 |
| iterations | 222 |
| time_elapsed | 844 |
| total_timesteps | 454656 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00322 |
| explained_variance | -0.00127 |
| learning_rate | 0.0001 |
| loss | 5.73e-06 |
| n_updates | 2210 |
| policy_gradient_loss | 1.05e-05 |
| value_loss | 0.000472 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.26e+03 |
| time/ | |
| fps | 538 |
| iterations | 223 |
| time_elapsed | 847 |
| total_timesteps | 456704 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00322 |
| explained_variance | 0.00419 |
| learning_rate | 0.0001 |
| loss | 5.63e-05 |
| n_updates | 2220 |
| policy_gradient_loss | 2.45e-06 |
| value_loss | 0.000328 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.26e+03 |
| time/ | |
| fps | 538 |
| iterations | 224 |
| time_elapsed | 851 |
| total_timesteps | 458752 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00322 |
| explained_variance | 0.000765 |
| learning_rate | 0.0001 |
| loss | 2.51e-06 |
| n_updates | 2230 |
| policy_gradient_loss | 9.92e-06 |
| value_loss | 0.000213 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.26e+03 |
| time/ | |
| fps | 538 |
| iterations | 225 |
| time_elapsed | 855 |
| total_timesteps | 460800 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00322 |
| explained_variance | 0.00546 |
| learning_rate | 0.0001 |
| loss | 2.08e-05 |
| n_updates | 2240 |
| policy_gradient_loss | -3.78e-06 |
| value_loss | 0.000144 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.26e+03 |
| time/ | |
| fps | 539 |
| iterations | 226 |
| time_elapsed | 858 |
| total_timesteps | 462848 |
| train/ | |
| approx_kl | 4.3168897e-05 |
| clip_fraction | 0.000342 |
| clip_range | 0.2 |
| entropy_loss | -0.00431 |
| explained_variance | -3.58e-07 |
| learning_rate | 0.0001 |
| loss | 26.2 |
| n_updates | 2250 |
| policy_gradient_loss | -0.000603 |
| value_loss | 59.9 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.26e+03 |
| time/ | |
| fps | 539 |
| iterations | 227 |
| time_elapsed | 862 |
| total_timesteps | 464896 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00456 |
| explained_variance | 0.000986 |
| learning_rate | 0.0001 |
| loss | 6.7e-05 |
| n_updates | 2260 |
| policy_gradient_loss | 3.21e-07 |
| value_loss | 0.000731 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.26e+03 |
| time/ | |
| fps | 538 |
| iterations | 228 |
| time_elapsed | 866 |
| total_timesteps | 466944 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00456 |
| explained_variance | 0.00046 |
| learning_rate | 0.0001 |
| loss | 9.5e-05 |
| n_updates | 2270 |
| policy_gradient_loss | -4.09e-06 |
| value_loss | 0.000639 |
---------------------------------------
------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.26e+03 |
| time/ | |
| fps | 539 |
| iterations | 229 |
| time_elapsed | 869 |
| total_timesteps | 468992 |
| train/ | |
| approx_kl | 9.384938e-06 |
| clip_fraction | 0.000537 |
| clip_range | 0.2 |
| entropy_loss | -0.00532 |
| explained_variance | -9.54e-07 |
| learning_rate | 0.0001 |
| loss | 6.46 |
| n_updates | 2280 |
| policy_gradient_loss | -0.000284 |
| value_loss | 5.81 |
------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.16e+03 |
| time/ | |
| fps | 539 |
| iterations | 230 |
| time_elapsed | 873 |
| total_timesteps | 471040 |
| train/ | |
| approx_kl | 4.7745532e-05 |
| clip_fraction | 0.000879 |
| clip_range | 0.2 |
| entropy_loss | -0.00401 |
| explained_variance | -6.44e-06 |
| learning_rate | 0.0001 |
| loss | 0.69 |
| n_updates | 2290 |
| policy_gradient_loss | -0.00114 |
| value_loss | 44.9 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.16e+03 |
| time/ | |
| fps | 539 |
| iterations | 231 |
| time_elapsed | 877 |
| total_timesteps | 473088 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00387 |
| explained_variance | 3.34e-06 |
| learning_rate | 0.0001 |
| loss | 6.99e-07 |
| n_updates | 2300 |
| policy_gradient_loss | 1.85e-07 |
| value_loss | 0.0985 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.16e+03 |
| time/ | |
| fps | 539 |
| iterations | 232 |
| time_elapsed | 881 |
| total_timesteps | 475136 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00387 |
| explained_variance | 0.0038 |
| learning_rate | 0.0001 |
| loss | 1.31e-05 |
| n_updates | 2310 |
| policy_gradient_loss | 1.99e-06 |
| value_loss | 4.28e-07 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.16e+03 |
| time/ | |
| fps | 539 |
| iterations | 233 |
| time_elapsed | 884 |
| total_timesteps | 477184 |
| train/ | |
| approx_kl | 1.3406709e-05 |
| clip_fraction | 0.000439 |
| clip_range | 0.2 |
| entropy_loss | -0.00315 |
| explained_variance | 3.76e-06 |
| learning_rate | 0.0001 |
| loss | 0.00739 |
| n_updates | 2320 |
| policy_gradient_loss | -0.000653 |
| value_loss | 0.959 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.16e+03 |
| time/ | |
| fps | 539 |
| iterations | 234 |
| time_elapsed | 887 |
| total_timesteps | 479232 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00311 |
| explained_variance | 0.00189 |
| learning_rate | 0.0001 |
| loss | 9.84e-08 |
| n_updates | 2330 |
| policy_gradient_loss | -9.19e-06 |
| value_loss | 2.8e-05 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.16e+03 |
| time/ | |
| fps | 539 |
| iterations | 235 |
| time_elapsed | 892 |
| total_timesteps | 481280 |
| train/ | |
| approx_kl | 2.2295513e-05 |
| clip_fraction | 0.000537 |
| clip_range | 0.2 |
| entropy_loss | -0.00255 |
| explained_variance | 1.19e-07 |
| learning_rate | 0.0001 |
| loss | 1.62 |
| n_updates | 2340 |
| policy_gradient_loss | 0.000371 |
| value_loss | 4.51 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.16e+03 |
| time/ | |
| fps | 539 |
| iterations | 236 |
| time_elapsed | 895 |
| total_timesteps | 483328 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00246 |
| explained_variance | 0.0475 |
| learning_rate | 0.0001 |
| loss | -3.92e-07 |
| n_updates | 2350 |
| policy_gradient_loss | -4.98e-07 |
| value_loss | 1.68e-06 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.16e+03 |
| time/ | |
| fps | 539 |
| iterations | 237 |
| time_elapsed | 899 |
| total_timesteps | 485376 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00246 |
| explained_variance | -0.0212 |
| learning_rate | 0.0001 |
| loss | -1.54e-05 |
| n_updates | 2360 |
| policy_gradient_loss | 1.16e-06 |
| value_loss | 1.67e-07 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.16e+03 |
| time/ | |
| fps | 539 |
| iterations | 238 |
| time_elapsed | 902 |
| total_timesteps | 487424 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00246 |
| explained_variance | 0.0803 |
| learning_rate | 0.0001 |
| loss | -1.09e-06 |
| n_updates | 2370 |
| policy_gradient_loss | 4.67e-07 |
| value_loss | 1.12e-07 |
---------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.16e+03 |
| time/ | |
| fps | 539 |
| iterations | 239 |
| time_elapsed | 907 |
| total_timesteps | 489472 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00246 |
| explained_variance | 0.00249 |
| learning_rate | 0.0001 |
| loss | -3.95e-07 |
| n_updates | 2380 |
| policy_gradient_loss | -1.47e-07 |
| value_loss | 7.71e-08 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.16e+03 |
| time/ | |
| fps | 539 |
| iterations | 240 |
| time_elapsed | 910 |
| total_timesteps | 491520 |
| train/ | |
| approx_kl | 1.1170399e-05 |
| clip_fraction | 0.000195 |
| clip_range | 0.2 |
| entropy_loss | -0.00276 |
| explained_variance | -2.38e-07 |
| learning_rate | 0.0001 |
| loss | 0.928 |
| n_updates | 2390 |
| policy_gradient_loss | -0.000436 |
| value_loss | 127 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.16e+03 |
| time/ | |
| fps | 540 |
| iterations | 241 |
| time_elapsed | 913 |
| total_timesteps | 493568 |
| train/ | |
| approx_kl | 6.7568675e-05 |
| clip_fraction | 0.00083 |
| clip_range | 0.2 |
| entropy_loss | -0.00389 |
| explained_variance | -3.81e-06 |
| learning_rate | 0.0001 |
| loss | 29.6 |
| n_updates | 2400 |
| policy_gradient_loss | -0.00105 |
| value_loss | 18.1 |
-------------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.16e+03 |
| time/ | |
| fps | 540 |
| iterations | 242 |
| time_elapsed | 917 |
| total_timesteps | 495616 |
| train/ | |
| approx_kl | 3.1772215e-05 |
| clip_fraction | 0.000391 |
| clip_range | 0.2 |
| entropy_loss | -0.00305 |
| explained_variance | -2.38e-06 |
| learning_rate | 0.0001 |
| loss | 0.0251 |
| n_updates | 2410 |
| policy_gradient_loss | -0.000649 |
| value_loss | 3.56 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.16e+03 |
| time/ | |
| fps | 539 |
| iterations | 243 |
| time_elapsed | 921 |
| total_timesteps | 497664 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00286 |
| explained_variance | 0.00779 |
| learning_rate | 0.0001 |
| loss | -3.53e-06 |
| n_updates | 2420 |
| policy_gradient_loss | -2.82e-06 |
| value_loss | 0.00105 |
---------------------------------------
-------------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.16e+03 |
| time/ | |
| fps | 540 |
| iterations | 244 |
| time_elapsed | 924 |
| total_timesteps | 499712 |
| train/ | |
| approx_kl | 1.5483238e-07 |
| clip_fraction | 0.000439 |
| clip_range | 0.2 |
| entropy_loss | -0.00317 |
| explained_variance | -2.32e-05 |
| learning_rate | 0.0001 |
| loss | 6.19 |
| n_updates | 2430 |
| policy_gradient_loss | -0.00028 |
| value_loss | 5.44 |
-------------------------------------------
---------------------------------------
| rollout/ | |
| ep_len_mean | 3.13e+04 |
| ep_rew_mean | -1.08e+03 |
| time/ | |
| fps | 540 |
| iterations | 245 |
| time_elapsed | 928 |
| total_timesteps | 501760 |
| train/ | |
| approx_kl | 0.0 |
| clip_fraction | 0 |
| clip_range | 0.2 |
| entropy_loss | -0.00268 |
| explained_variance | -0.00652 |
| learning_rate | 0.0001 |
| loss | 0.000133 |
| n_updates | 2440 |
| policy_gradient_loss | 3.24e-07 |
| value_loss | 0.000636 |
---------------------------------------
# Initialize counters
buy_trades = 0
sell_trades = 0
previous_action = None
for trade in my_forex_env.trade_log:
current_action = trade['action']
if current_action != previous_action:
if current_action == Actions.Buy.value:
buy_trades += 1
elif current_action == Actions.Sell.value:
sell_trades += 1
previous_action = current_action
print(f"Buy trades executed: {buy_trades}")
print(f"Sell trades executed: {sell_trades}")
import matplotlib.pyplot as plt
buy_steps = [log['step'] for log in my_forex_env.trade_log if log['action'] == Actions.Buy.value]
sell_steps = [log['step'] for log in my_forex_env.trade_log if log['action'] == Actions.Sell.value]
prices = my_forex_env.prices
plt.figure(figsize=(14, 7))
plt.plot(prices[180:360], label='Price', color='gray', linewidth=1, alpha=0.9)
plt.scatter(buy_steps, [prices[step] for step in buy_steps], color='green', label='Buy', marker='^', s=20, alpha=0.7)
plt.scatter(sell_steps, [prices[step] for step in sell_steps], color='red', label='Sell', marker='v', s=20, alpha=0.7)
plt.title('Trading Strategy Execution Over Time')
plt.xlabel('Steps')
plt.ylabel('Price')
plt.legend(loc='best')
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
action_stats = {Actions.Sell: 0, Actions.Buy: 0}
#model = PPO.load("ppo_forex")
total_rewards = []
actions_taken = []
obs, info = my_forex_env.reset()
key_step = 0
while True:
action, _states = model.predict(obs)
action_stats[Actions(action)] += 1
action_value = action.item() if isinstance(action, np.ndarray) and action.ndim == 0 else action
actions_taken.append((key_step, action_value)) # Store the step and action value
obs, rewards, terminated, truncated, info = my_forex_env.step(action)
dones = terminated or truncated
total_rewards.append(info['total_reward'])
key_step +=1
if dones:
break # Stop the loop if done
print("action_stats:", action_stats)
print("info:", info)
action_stats: {<Actions.Sell: 0>: 10, <Actions.Buy: 1>: 31316}
info: {'total_reward': 219.88590216636658, 'total_profit': 0.9955382712446159, 'position': <Positions.Long: 1>}
Profit and reward¶
plt.figure(figsize=(15, 6))
plt.plot(my_forex_env.history['total_profit'], label='Price', color='gray', alpha=0.5)
[<matplotlib.lines.Line2D at 0x79ae3b76e1a0>]
plt.figure(figsize=(15, 6))
plt.plot(my_forex_env.history['total_reward'], label='Price', color='gray', alpha=0.5)
[<matplotlib.lines.Line2D at 0x79ae5eb4e7a0>]
Test data¶
window_size = 180
start_date = "2017-12-01"
end_date = "2017-12-31"
filtered_data2 = df2[start_date:end_date]
start_index = window_size
end_index = len(filtered_data2)
eval_env = MyStocksEnv(df=filtered_data2, window_size=window_size, frame_bound=(start_index, end_index))
(SMA, MACD, RSI) wern't all present. Reverting to default signal features (price and price-diff)
action_stats_eval = {Actions.Sell: 0, Actions.Buy: 0}
total_rewards_eval = []
actions_taken_eval = []
obs, info = eval_env.reset()
key_step = 0
while True:
action, _states = model.predict(obs)
action_stats_eval[Actions(action)] += 1
action_value = action.item() if isinstance(action, np.ndarray) and action.ndim == 0 else action
actions_taken_eval.append((key_step, action_value)) # Store the step and action value
obs, rewards, terminated, truncated, info = eval_env.step(action)
dones = terminated or truncated
total_rewards_eval.append(info['total_reward'])
key_step +=1
if dones:
break
print("action_stats:", action_stats_eval)
print("info:", info)
action_stats: {<Actions.Sell: 0>: 14, <Actions.Buy: 1>: 26895}
info: {'total_reward': 34.86785876750946, 'total_profit': 0.9964572716774631, 'position': <Positions.Long: 1>}
# Initialize counters
buy_trades_eval = 0
sell_trades_eval = 0
previous_action = None
for trade in eval_env.trade_log:
current_action = trade['action']
if current_action != previous_action:
# This ensures we're only counting actions that result in a trade (position change)
if current_action == Actions.Buy.value:
buy_trades_eval += 1
elif current_action == Actions.Sell.value:
sell_trades_eval += 1
previous_action = current_action
print(f"Buy trades executed: {buy_trades_eval}")
print(f"Sell trades executed: {sell_trades_eval}")
Profit and reward test¶
plt.plot(eval_env.history['total_profit'], label='Price', color='gray', alpha=0.5)
[<matplotlib.lines.Line2D at 0x79ae3a118790>]
plt.plot(eval_env.history['total_reward'], label='Price', color='gray', alpha=0.5)
[<matplotlib.lines.Line2D at 0x79ae3a17be20>]
prices_eval = list(filtered_data2.Close.values)
buy_steps_eval = [key_step for key_step, action in actions_taken_eval if action == Actions.Buy.value]
sell_steps_eval = [key_step for key_step, action in actions_taken_eval if action == Actions.Sell.value]
# Extract the prices for the buy and sell steps
buy_prices_eval = [prices_eval[step] for step in buy_steps_eval]
sell_prices_eval = [prices_eval[step] for step in sell_steps_eval]
plt.figure(figsize=(15, 6))
plt.plot(prices_eval, label='Price', color='gray', alpha=0.5)
plt.scatter(sell_steps_eval, sell_prices_eval, label='Out of position', color='red', marker='v', alpha=0.7, s = 10)
plt.scatter(buy_steps_eval, buy_prices_eval, label='In position', color='green', marker='^', alpha=0.7, s=10)
plt.title('Trading Performance with Buy and Sell Actions')
plt.xlabel('Step')
plt.ylabel('Price')
plt.legend()
plt.show()
plt.figure(figsize=(15, 6))
plt.plot(total_rewards_eval, label='Total Rewards', color='blue')
plt.title('Total Rewards Over Time')
plt.xlabel('Step')
plt.ylabel('Total Reward')
plt.legend()
plt.show()
PPO Results¶
During the train the model learns faster, closer to 10000 step the reward starts to stabilize and grow back after the fall in the beginning. It describes the process of how the model learns to buy and sell profitably. In the beginning it does actions that might lead to further lower return, but gets penalized for unprofitable decisions.
action_stats_train: {<Actions.Sell: 0>: 10, <Actions.Buy: 1>: 31316}
action_stats_test: {<Actions.Sell: 0>: 14, <Actions.Buy: 1>: 26895}
The behaviour of the trading agent is rather cautious, because most of the time they hold what they have bought, looking forward to the better moment for a sell. According to gym_anytrading properties - there are only 2 actions possible, so that when statistics say that Actions.Buy equals 31316 during train - it means both active action of buying and passive action of holding, while Actions.Sell means you either sell or if you don't have assets, then you are just out of position. This environment (in comparison to a more advanced proposed earlier) does not allow for short-selling.
With the total reward that was obtained during the training - it's 219.886 which is positive in the end, while having negative reward in the beginning. This provides evidence that the model does learn. Despite this, when the model is fitted for the whole month, it is almost impossible to get a gain in profit. At the same time, when changing the time frame for half a month, with proper parametrization, it is possible to get a small gain in profit afterall. It provides the evidence that these algorithms are able to learn the mechanics on a more short-run horizon, especially when dealing with high-frequency (minute) data.
According to training, the total profit equals 0.9955, which means that even though the trading agent did not make money, it managed not to loose much: just 0.0045. Considering the test set that covers the month of absolutely different market behaviour - December, the model (fitted to March) allowed the trading agent not to loose a lot of money: just 0.0035.
Even though during the test the profit was showing better performance than on test, the total reward for test is lower (34.87). This means that the model experiences difficulties when seeing the unusual data (with different trading behaviour month). It can be seen in the plot that the reward is decreasing untill 15000 step with reaching -200 which is much bigger in absolute values than that of train. Only after 15000 step the model starts to gain reward. This means it takes time for a model to get used to new specific data.
In conclusion: this study recommends to utilize the PPO model with a shorter frame of time if using it for Forex Trading. Usually for PPO one needs less number of timesteps to fit the model than in A2C.Furthermore, it is adviced to store different models for different market behaviour and use them in ensemble.
Conclusion¶
In the first half of this project we compared the performance of A2C, PPO, and DQN algorithms trading on simulated stocks, with prices characterized by sinusoidal waves with varying levels of noise. In Section 1.5.3, we discussed the evolution of model performance as the signals approach white noise. We also compared the performance of these DRL models against a classical benchmark strategy based on mean reversion. At low noise levels, the mean reversion benchmark outperformed all algorithms, but this advantage weakened as noise increased. DQN and A2C performed notably well, even after accounting for volatility. However, in the test set, DQN became unprofitable at high noise values, while A2C remained profitable. Further exploration is needed to better understand A2C's performance on noisy signals.
In the second half of the project, we explored DRL for Forex trading. The implementation of DQN architectures in a customized trading environment revealed insights into trading agents behaviors. The enriched action space and position dynamics in the custom environment allows the DQN models to execute more complex strategies. This provides a closer approximation to real-world trading scenarios than simpler environments by letting the agent short sell. The results suggest that as the DQN is trained for a greater number of timesteps, there is a tendency towards a more diversified action distribution, supporting the hypothesis that extended market exposure enhances strategic depth and adaptability. Consequently, the adaptation mitigates risk of overfitting to narrow market trends, fostering a robustness which is of high importance for the volatile forex environment.
The complexities of the DQN architectures seem to simulate different trading philosophies—from conservative to aggressive strategies—mirroring real-world trading scenarios which can be seen in Section 2.1.4. Larger, more complex networks (like DeepMind architecture with high number of neurons) appear to capture subtle market nuances better, with appearance of higher risk. In contrast, simpler neural nets are prioritizing stability and consistent returns over high-risk and high-reward trades. Application of both A2C and PPO algorithms for Forex trading demonstrates that each model adapts differently under market conditions of two distant in terms of trading behaviour months. The A2C tends to require more time to stabilize, indicating a need for longer training periods, while PPO shows quicker adaptation to market changes, suggesting it might be more effective in shorter time frames. Both models initially display a conservative strategy, which suggests rather a risk-averse approach that prioritizes waiting for optimal conditions over frequent trading. The results emphasize the potential benefits of using a combination of models to handle diverse market behaviors and enhance trading strategies effectively.
Bibliography¶
Baradja, A., Gernowo, R., and Wibowo, A. (2023): "Optimizing Advantage Actor-Critic with Policy Gradient and Deep Q-learning to Maximize Profit in Forex Trading Prediction." Presented at the 2023 1st IEEE International Conference on Smart Technology (ICE-SMARTec). Available at: DOI.
Briola, A. et al. (2023): "Deep Reinforcement Learning for Active High Frequency Trading." Available on arXiv: DOI.
Carapuço, J., Neves, R., and Horta, N. (2018): "Reinforcement learning applied to Forex trading." Published in Applied Soft Computing, 73, pp. 783–794. Available at: DOI.
Deng, Y. et al. (2017): "Deep Direct Reinforcement Learning for Financial Signal Representation and Trading." Published in IEEE Transactions on Neural Networks and Learning Systems, 28(3), pp. 653–664. Available at: DOI.
Ganesh, P., and Rakheja, P. (2018): "Deep Reinforcement Learning in High Frequency Trading."
Haghpanah, M.A. (2024): Repository: "AminHP/gym-anytrading." Available at: GitHub (Accessed: 1 April 2024).
Huang, S. et al. (2024): "Open RL Benchmark: Comprehensive Tracked Experiments for Reinforcement Learning." Available on arXiv: arXiv link (Accessed: 11 March 2024).
Jaddu, K.S., and Bilokon, P.A. (2023): "Combining Deep Learning on Order Books with Reinforcement Learning for Profitable Trading." Available on arXiv: arXiv link (Accessed: 27 February 2024).
Liu, X.-Y., Yang, H., et al. (2022): "FinRL: deep reinforcement learning framework to automate trading in quantitative finance." Presented at the Second ACM International Conference on AI in Finance (ICAIF ’21). Available at: DOI.
Liu, X.-Y., Xiong, Z., et al. (2022): "Practical Deep Reinforcement Learning Approach for Stock Trading." Available on arXiv: arXiv link (Accessed: 29 January 2024).
Mnih, V. et al. (2013): "Playing Atari with Deep Reinforcement Learning." Available on arXiv: DOI.
Mnih, V. et al. (2016): "Asynchronous Methods for Deep Reinforcement Learning." Available on arXiv: arXiv link (Accessed: 31 January 2024).
Moody, J., and Saffell, M. (1998): "Reinforcement Learning for Trading." Presented at Advances in Neural Information Processing Systems. Available at: Abstract (Accessed: 9 May 2024).
Nevmyvaka, Y., Feng, Y., and Kearns, M. (2006): "Reinforcement learning for optimized trade execution." Presented at the 23rd International Conference on Machine Learning (ICML ’06), pp. 673–680. Available at: DOI.
Schulman, J. et al. (2017): "Proximal Policy Optimization Algorithms." Available on arXiv: DOI.
Taherizadeh, A., and Zamani, S. (2023): "Winner Strategies in a Simulated Stock Market." Published in the International Journal of Financial Studies, 11(2), p. 73. Available at: DOI.
Tsai, Y.-C. et al. (2020): "Deep Reinforcement Learning for Foreign Exchange Trading." In Trends in Artificial Intelligence Theory and Applications, edited by H. Fujita et al. (Artificial Intelligence Practices). Cham: Springer International Publishing, pp. 387–392. Available at: DOI.
Zhang, Z., Zohren, S., and Roberts, S. (2019): "Deep Reinforcement Learning for Trading." Available on arXiv: DOI.
Contributions¶
33.33% each